tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
466 stars 73 forks source link

Optimize Dispatch Time for TTNN Ops #12074

Open caixunshiren opened 2 months ago

caixunshiren commented 2 months ago

Issue

Dispatch time can be a limiting factor for perf, especially for decode, where op device latency are low. We see that ops generally have 3k to 6k cycles of dispatch time, regardless of the device runtime of the ops. We show a traced Liama decode perf below:

image

https://docs.google.com/spreadsheets/d/1h8BOeg1dPL9aIum6VLLlQM-KpBM3l7feAWNYji9-YeE/edit?usp=sharing

For smaller ops, dispatch latency could take almost 50% of total runtime, and removing all dispatch latency could improve perf from 15.4 to 17.8 Tok/s/u. There appears to be a roughly 4k cycles constant dispatch time regardless the kernel. While there are several outliers that take 8-9k cycles to dispatch, improving the dispatch latency of these kernels further do not have major improvement on perf, as we projected about 0.2 Tok/s/u improvement if they are reduced to 4k. On my on-going prs, I have already improved them from 20k+ cycles to 8-9k (https://github.com/tenstorrent/tt-metal/pull/11957, https://github.com/tenstorrent/tt-metal/pull/11967).

If there could be an generic optimization to improve dispatch time for all ops, then it will bring significant change to all models' decode perf.

FYI: @uaydonat @jvasilje @cglagovichTT

sraizada-tt commented 2 months ago

Mixtral dispatch: https://github.com/tenstorrent/tt-metal/issues/12282