Dispatch time can be a limiting factor for perf, especially for decode, where op device latency are low. We see that ops generally have 3k to 6k cycles of dispatch time, regardless of the device runtime of the ops. We show a traced Liama decode perf below:
For smaller ops, dispatch latency could take almost 50% of total runtime, and removing all dispatch latency could improve perf from 15.4 to 17.8 Tok/s/u. There appears to be a roughly 4k cycles constant dispatch time regardless the kernel. While there are several outliers that take 8-9k cycles to dispatch, improving the dispatch latency of these kernels further do not have major improvement on perf, as we projected about 0.2 Tok/s/u improvement if they are reduced to 4k. On my on-going prs, I have already improved them from 20k+ cycles to 8-9k (https://github.com/tenstorrent/tt-metal/pull/11957, https://github.com/tenstorrent/tt-metal/pull/11967).
If there could be an generic optimization to improve dispatch time for all ops, then it will bring significant change to all models' decode perf.
Issue
Dispatch time can be a limiting factor for perf, especially for decode, where op device latency are low. We see that ops generally have 3k to 6k cycles of dispatch time, regardless of the device runtime of the ops. We show a traced Liama decode perf below:
https://docs.google.com/spreadsheets/d/1h8BOeg1dPL9aIum6VLLlQM-KpBM3l7feAWNYji9-YeE/edit?usp=sharing
For smaller ops, dispatch latency could take almost 50% of total runtime, and removing all dispatch latency could improve perf from 15.4 to 17.8 Tok/s/u. There appears to be a roughly 4k cycles constant dispatch time regardless the kernel. While there are several outliers that take 8-9k cycles to dispatch, improving the dispatch latency of these kernels further do not have major improvement on perf, as we projected about 0.2 Tok/s/u improvement if they are reduced to 4k. On my on-going prs, I have already improved them from 20k+ cycles to 8-9k (https://github.com/tenstorrent/tt-metal/pull/11957, https://github.com/tenstorrent/tt-metal/pull/11967).
If there could be an generic optimization to improve dispatch time for all ops, then it will bring significant change to all models' decode perf.
FYI: @uaydonat @jvasilje @cglagovichTT