Open cglagovichTT opened 2 months ago
I enabled tracing on branch and measured 123.7 ms per iteration (8.08 t/s/u) from python. This is an improvement over the 1 t/s/u we see in eager mode.
The device profile shows us that the sum of dispatch+FW times gives us 115.8 ms on device. This is 7.9 ms off of the e2e measurement, easily explained by time to untilize on host (blocked by #11509).
Investigating the perf dump shows us device FW time: 12.14 t/s/u + dispatch: 8.6 t/s/u + untilize on host: 8.08 t/s/u
On TG, dispatch accounts for 28.8% of on-device time, compared to 15% on T3k. #11398 is also relevant for this issue, since dispatch times for line_all_gather are typically 34 µs, even longer than the ~20 µs for ring AllGather on T3k.
Note that there is a hang that shows up with and without tracing - this is a separate issue.
FYI @SeanNijjar for LineAllGather dispatch. ~@tt-asaigal says that we certainly are hitting the CCL stall on TG.~ FYI @davorchap for dispatch proportion on TG with tracing
CCL stall is being hit in eager mode. This will lead to dispatch overhead for the next op after a CCL being higher than usual. For trace, this should not be the case. Interesting that, even with trace and chip local traffic for dispatch, we're seeing higher overhead on TG.
FYI @SeanNijjar for LineAllGather dispatch. @tt-asaigal says that we certainly are hitting the CCL stall on TG.
Just to be clear here. By stall here, you are referring to the current pipeline flush that we have as a work-around for dispatch deadlock until we have multi-channel support, correct? I kind of assumed any dispatcher measurements while that's enabled would be too noisy to be meaningful due to the stall.
spreadsheet. TG Llama has more and smaller ops than t3k, so dispatch contributes more to e2e.
Discuss TG Llama e2e perf in this issue
First tracy shows that concatting output on host is a huge 250ms bottleneck. I resolved this by using AllGather on device. In addition, 1L dispatch on main thread takes 12ms while device FW time per layer is ~860 µs. This indicates that tracing is required to achieve high e2e.