tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
455 stars 67 forks source link

TG Llama e2e perf #11472

Open cglagovichTT opened 2 months ago

cglagovichTT commented 2 months ago

Discuss TG Llama e2e perf in this issue

image

First tracy shows that concatting output on host is a huge 250ms bottleneck. I resolved this by using AllGather on device. In addition, 1L dispatch on main thread takes 12ms while device FW time per layer is ~860 µs. This indicates that tracing is required to achieve high e2e.

cglagovichTT commented 2 months ago

I enabled tracing on branch and measured 123.7 ms per iteration (8.08 t/s/u) from python. This is an improvement over the 1 t/s/u we see in eager mode.

image

The device profile shows us that the sum of dispatch+FW times gives us 115.8 ms on device. This is 7.9 ms off of the e2e measurement, easily explained by time to untilize on host (blocked by #11509).

image

Investigating the perf dump shows us device FW time: 12.14 t/s/u + dispatch: 8.6 t/s/u + untilize on host: 8.08 t/s/u

On TG, dispatch accounts for 28.8% of on-device time, compared to 15% on T3k. #11398 is also relevant for this issue, since dispatch times for line_all_gather are typically 34 µs, even longer than the ~20 µs for ring AllGather on T3k.

cglagovichTT commented 2 months ago

Note that there is a hang that shows up with and without tracing - this is a separate issue.

cglagovichTT commented 2 months ago

FYI @SeanNijjar for LineAllGather dispatch. ~@tt-asaigal says that we certainly are hitting the CCL stall on TG.~ FYI @davorchap for dispatch proportion on TG with tracing

tt-asaigal commented 2 months ago

CCL stall is being hit in eager mode. This will lead to dispatch overhead for the next op after a CCL being higher than usual. For trace, this should not be the case. Interesting that, even with trace and chip local traffic for dispatch, we're seeing higher overhead on TG.

SeanNijjar commented 2 months ago

FYI @SeanNijjar for LineAllGather dispatch. @tt-asaigal says that we certainly are hitting the CCL stall on TG.

Just to be clear here. By stall here, you are referring to the current pipeline flush that we have as a work-around for dispatch deadlock until we have multi-channel support, correct? I kind of assumed any dispatcher measurements while that's enabled would be too noisy to be meaningful due to the stall.

cglagovichTT commented 2 months ago

spreadsheet. TG Llama has more and smaller ops than t3k, so dispatch contributes more to e2e.