Open sraizada-tt opened 1 month ago
Spreadsheet says device perf is 25.34 t/s/u and e2e perf is 14.39 t/s/u at seq_length=32
If dispatch time is only 5% where is the remaining e2e perf? all untilize?
Can we list these?:
device perf: X t/s/u
Mixtral e2e perf issue: https://github.com/tenstorrent/tt-metal/issues/12307
Assigning to @pgkeller to optimize the dispatch times
see #11796 for current efforts (will create other issues for work later that is subsequent to this effort)
see #11796 for current efforts (will create other issues for work later that is subsequent to this effort)
It would make sense when you guys improve dispatch times, we come back to these issues and measure the improvement and compare against the expectations.
see #11796 for current efforts (will create other issues for work later that is subsequent to this effort)
I realized the issue above is pretty terse, some detail: 1) Current effort is to dispatch kernel n+1 while kernel n is running to hide much of the cost. The groundwork for this has been going in over literally months, we're getting close, but there is still lots of work to do (including some research) 2) We plan to add a 2nd dispatcher to split the work of writing RTAs. This won't benefit dispatch-on-eth-on-WH though 3) We plan to duplicate RTAs in certain circumstances to reduce riscv overhead 4) We plan to add a 2nd processor for reading from DRAM, I suspect this is not generally the bottleneck for dispatch but will benefit other use cases (and some kernel dispatch cases)
This won't benefit dispatch-on-eth-on-WH though
does it mean we do not expect zero dispatch (or any benefit) for CCL ops?
This won't benefit dispatch-on-eth-on-WH though
does it mean we do not expect zero dispatch (or any benefit) for CCL ops?
There are multiple efforts: 1) Hide cost of dispatch behind running kernels. This applies to eth-WH 2) Speed up dispatch. Some of these changes will apply to eth-WH, some will not (eth-WH only has 1 riscv so won't benefit from, eg, using 2 dispatchers)
When all is said and done, eth-WH will be slower then tensix dispatch in some cases though (in particularly, multiple short duration kernels in a row)
Dispatch time can be a limiting factor for perf, especially for decode, where op device latency are low. We show a traced Mixtral decode perf below:
If dispatch went to 0, perf boost would be: 1.057959999442948 X t/s/u fyi @yieldthought @uaydonat @mtairum @xuncaiTT