Optimize Dispatch Time for Mixtral Decode Ops

tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

Apache License 2.0

459 stars 68 forks source link

Optimize Dispatch Time for Mixtral Decode Ops #12282

Open sraizada-tt opened 1 month ago

sraizada-tt commented 1 month ago

Dispatch time can be a limiting factor for perf, especially for decode, where op device latency are low. We show a traced Mixtral decode perf below:

If dispatch went to 0, perf boost would be: 1.057959999442948 X t/s/u fyi @yieldthought @uaydonat @mtairum @xuncaiTT

uaydonat commented 1 month ago

Spreadsheet says device perf is 25.34 t/s/u and e2e perf is 14.39 t/s/u at seq_length=32

If dispatch time is only 5% where is the remaining e2e perf? all untilize?

Can we list these?:

device perf: X t/s/u

dispatch overhead: X t/s/u
untilize on host(?): X t/s/u
what else(?): X t/s/u

sraizada-tt commented 1 month ago

Mixtral e2e perf issue: https://github.com/tenstorrent/tt-metal/issues/12307

uaydonat commented 1 month ago

Assigning to @pgkeller to optimize the dispatch times

pgkeller commented 1 month ago

see #11796 for current efforts (will create other issues for work later that is subsequent to this effort)

uaydonat commented 1 month ago

see #11796 for current efforts (will create other issues for work later that is subsequent to this effort)

It would make sense when you guys improve dispatch times, we come back to these issues and measure the improvement and compare against the expectations.

pgkeller commented 1 month ago

see #11796 for current efforts (will create other issues for work later that is subsequent to this effort)

I realized the issue above is pretty terse, some detail: 1) Current effort is to dispatch kernel n+1 while kernel n is running to hide much of the cost. The groundwork for this has been going in over literally months, we're getting close, but there is still lots of work to do (including some research) 2) We plan to add a 2nd dispatcher to split the work of writing RTAs. This won't benefit dispatch-on-eth-on-WH though 3) We plan to duplicate RTAs in certain circumstances to reduce riscv overhead 4) We plan to add a 2nd processor for reading from DRAM, I suspect this is not generally the bottleneck for dispatch but will benefit other use cases (and some kernel dispatch cases)

uaydonat commented 1 month ago

This won't benefit dispatch-on-eth-on-WH though

does it mean we do not expect zero dispatch (or any benefit) for CCL ops?

pgkeller commented 1 month ago

This won't benefit dispatch-on-eth-on-WH though

does it mean we do not expect zero dispatch (or any benefit) for CCL ops?

There are multiple efforts: 1) Hide cost of dispatch behind running kernels. This applies to eth-WH 2) Speed up dispatch. Some of these changes will apply to eth-WH, some will not (eth-WH only has 1 riscv so won't benefit from, eg, using 2 dispatchers)

When all is said and done, eth-WH will be slower then tensix dispatch in some cases though (in particularly, multiple short duration kernels in a row)