Perf estimated at around 1.78 ms per layer -> to 17.5 tok/s end2end (seqlen = 1) and around 3.1 ms per layer -> 10 tok/s (seqlen=2048).
Tasks
[x] Correctness validated for hundreds of iterations. Users stopping at EoS token.
[ ] Perf optimizations
[x] Device perf above target: 24.5 t/s/u
[x] e2e perf 1% of device: Main culprit is ttnn multidevice. E.g. ttnn.linear takes 7ms vs ttlib 70us.
[x] Fixed issue with program cache not being properly used.
[ ] Largest bottlenecks are the MLP matmuls followed by allGather
[x] Update cache ops follow right after, partially due to pad and transpose ops.
[ ] Implement update cache with parallelization on batch
[x] Implement fused op for pad + transpose and unpad + transpose.
[ ] Add embedding + argmax to device
[x] Port the code to make use of ttnn multiDevice API
[x] Redo sharded KV cache
[x] Due to issue in softmax correctness (different users would have different softmax values, with the same input), the keys and values were commented out back to interleaved
[x] Explore sharded softmax (optimal) and report issue (or convert to interleave right before and after)
[x] Improve test model PCC comparison
[x] Instead of relying on baseline output, instead do teacher forcing and compare PCC across many iterations. It should grow higher after a short while.
Mistral
Status
On main models/demos/mistral.
Device perf: 13.3 tok/s/u
e2e perf: 10.9 tok/s/u
Tasks
[x] Improved weight loading to use flag system.
[x] Stress test (repeat iteration 100 on single decoder test for 1 to 2 hours to test stability)
[x] Runs out of L1 memory after iteration 567
[x] Initial look at L1 visualizer after 70 iterations didn't showcase any memory leaks
[x] Batched attention. [MERGED]
[x] initial e2e perf measured at around 3.2 tok/s/u
[x] Hanging after iteration 31
Issues
[x] CI test decoder hanging. [ RESOLVED]
[x] Stress test again on latest version and debug memory allocation issue if still required
Bring-up of Mistral and Mixtral on Wormhole for single-chip and multi-chip respectively.
Mixtral8x7b
Status
In Main, inside
models/demos/t3000/mixtral8x7b
Perf Target -> 33 tok/s/u
[6 Jun 2024]
[25 May 2024]
Tasks
pad
andtranspose
ops.pad + transpose
andunpad + transpose
.Mistral
Status
On main
models/demos/mistral
.Device perf: 13.3 tok/s/u e2e perf: 10.9 tok/s/u
Tasks
[x] Improved weight loading to use flag system.
[x] Batched attention. [MERGED]
Issues