Open cglagovichTT opened 2 months ago
fyi @davorchap
I've found a core issue in the dispatch path that really hurts CCL performance (especially for these smaller CCLs). It's here; https://github.com/tenstorrent/tt-metal/issues/12395.
With this change, and everything in master, I was seeing some all-gather instances with cycle counts as low as 21k (several thousand cycles better than the current target).
Updated sheet after applying fused AllGather matmul + CCL team's various optimizations. Summary: device dispatch fully explains the difference between device perf and e2e perf.
Device: 19.3 t/s/u With dispatch: 15.5 t/s/u With host <-> device data transfer: 15.1 t/s/u
2.5-3 weeks low latency drop on t3k (ring buffer) Turn of eth dispatch for a win (8x7 grid) -- reduces prefetcher and dispatch buffering, hurts BW to DRAM
Paul merged worst-case dispatch optimization last night. Aditya is going to merge in a few days which undoes the optimization.
Decode 128
We are aiming to hit 20 t/s/u end to end for Llama3 decode on t3k. These are the issues left.
Device Perf
Target: 625µs per layer on device. Today we are at 685 µs per layer. In order to hit the target, we have three issues:
If we resolve these three issues and achieve the expected speedup for each of them, we will be nicely beyond 20 t/s/u on device.
E2E Perf
Dispatch adds up to 162µs per layer. CCL accounts for 84µs of that. We have one item of work which optimizes CCL dispatch to bring the total dispatch time per layer to 108µs.
Beyond that, we need zero-latency dispatch in order to achieve 20 t/s/u end to end. https://github.com/tenstorrent/tt-metal/issues/12074 this issue has information on the request.