tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
https://docs.tenstorrent.com/ttnn/latest/index.html
Apache License 2.0
491 stars 81 forks source link

T3k Llama trace e2e perf #11386

Open cglagovichTT opened 3 months ago

cglagovichTT commented 3 months ago

Update: these are the issues we're tracking for trace e2e perf

Running Llama decode with tracing does not significantly improve e2e perf compared to eager execution. See Ammar's table https://github.com/tenstorrent/tt-metal/issues/10014#issuecomment-2278031527

Layers No trace e2e (t/s/u) Trace e2e (t/s/u) Avg Trace time of 50 iters (s)
1 65.8 88.18724174 0.01133950876
2 60.9 80.84390727 0.01236951594
10 41 48.29122735 0.02070769485
20 28.8 29.86463989 0.03348441513
30 22.2 23.19186934 0.04311855958
80 10.6 11.04005052 0.09057929564

I collected the tracy when running with trace (perf governor, tracy build) and I see that each iteration of decode takes 107 ms. image

80ms are spent waiting for read_buffer and 25ms are spent logging (?) the to_torch call on the logits.

To look at this from the dispatch angle, I traced a single layer of execution and looked at the op-to-op latencies. image

Dispatch latencies range from 2 to 22 microseconds. Summing the dispatch times and adding them to device FW times gives us 76.3 ms of time spend in dispatch + kernels, which is about the amount of time that tracy shows the host waits on read_buffer. Dispatch contributes 148231/954959 = 15% of the device latency. Note that the device perf for this traced run (806 us per layer) is worse than untraced device perf (737 us per layer). Each AllGather seems to be ~20us slower in this trace compared to my untraced perf.

cglagovichTT commented 3 months ago

FYI @uaydonat please add to Llama3 project

cglagovichTT commented 3 months ago

@arakhmati is the logging in ttnn/cpp/ttnn/deprecated/tt_lib/csrc/tt_lib_bindings_tensor_pytensor.cpp:606 only enabled in tracy builds, or should I expect this to be an issue in release builds as well?

Why would this logging even be enabled in tracy builds?

cglagovichTT commented 3 months ago

fyi @tt-asaigal We are seeing that AllGather op-to-op latency ranges from 20-27 µs. MMIO chips are 20 µs, tunneled chips are 27 µs. There's clearly work to do on the AllGather op to reduce dispatcher pressure, but @SeanNijjar mentioned that the 7 µs difference between L and R chips could be explained by dispatcher flush.

Spreadsheet source

cglagovichTT commented 3 months ago

Compiled some data on AllGather times across devices with trace execution. The variance between devices ranges from 14 µs to 52 µs image

SeanNijjar commented 3 months ago

Is this a trace with other ops? If other ops are in the mix, we won't know if it's due to speed of different devices (and one chip waiting for the other chip to "catch up") or something else.

cglagovichTT commented 3 months ago

That's true, it's a trace with other ops in the mix... it will be tough to disentangle that.

Here's a section of the profile that spans AllGather -> matmul -> add -> AllGather image

I assume the first AllGather is a perfect barrier so all devices finish at the same time. Matmul and Add times for dispatch and device FW are very close for all the devices, so I don't see the delay being introduced there. The second AllGather shows the 7 µs difference in dispatch times between L and R devices, but that doesn't explain why dev7 for AllGather2 is 10 µs faster than some other tunneled devices.

cglagovichTT commented 3 months ago

Regarding the 25 ms e2e hit we get from concat + logging on host, I was able to remove that by doing AllGather on device and reading from device 0. We now see 88 ms per iteration, which is 11.36 t/s/u. Thanks Aditya :) image

SeanNijjar commented 3 months ago

The second AllGather shows the 7 µs difference in dispatch times between L and R devices, but that doesn't explain why dev7 for AllGather2 is 10 µs faster than some other tunneled devices.

Unfortunately I don't even think we can directly compare those numbers because what can happen is depending on when and which order of chips the CCL op starts on, you could form large parts of a ring (and start execution) while the slower chip is still coming up.

In a way this can "prefetch" some of the work so by the time the slower chip comes up, it's able to just push/pull data without ever stalling.

What I think we really need is a way to measure dispatch perf isolated from op perf. Especially for CCLs, the two interact with each other in ways that they simply don't for single chip ops. @pgkeller: is there any infra you guys have that I could piggy back off of on CCL where I just do pure dispatch and measure that time? If I could run perf regressions against it that would be even better. I figure, even if it's end-to-end time, it should still positively correlate to op dispatch throughput.

tt-asaigal commented 3 months ago

Spoke to Sean offline, and we likely can't simply measure the time spent on actual compute by the All-Gather kernel. The kernel/device FW times reported in the spreadsheet account for handshaking/barriers across devices, which implicitly also accounts for dispatch time.

@tt-aho @mo-tenstorrent do you think we can accurately measure dispatch perf in isolation if we get rid of the launch-msg write to workers + dispatcher wait for worker completion? If not, we could just rely on the dispatch latency thats currently reported on the spreadsheet, I don't think kernel execution itself will affect dispatch latency.

Total device time for all-gather can likely be improved, given that it has increased load on the dispatcher (ex: unicasting binaries to all workers). I'm currently not sure why all-gather dispatch time for non-MMIO chips is consistently worse. All traffic, once the trace cmd has been received is local to the chip. This needs to be investigated.

mo-tenstorrent commented 3 months ago

I am correct in saying that op issuance time from host for the all gather op is different across chips. That time difference plus the dispatch overhead, which should be safe to assume it is fairly consistent across chips of the same type (remote vs local) is playing into rendering algather reported FW /Kernel times unrepresentative of the actual time of the kernel?

In this scenario, wouldn't the smallest kernel time in the group of all_gather ops represent a representative value of the actual kernel time?

@SeanNijjar Dispatch overhead can be measured in tracy using dispatch profiling. TT_METAL_DEVICE_PROFILER_DISPATCH=1 profiles our prefetcher and dispatch cores.

mo-tenstorrent commented 3 months ago

One note to share from my conversation with Colman, The dispatch profiler only supports tensix cores. I forgot that Llama does 8x8 grids with eth dispatch. Working on eth dispatch profiling atm and will give an update soon.

SeanNijjar commented 3 months ago

Thanks @mo-tenstorrent, for my purposes, I can just use the tensix dispatch cores for the sake of characterizing CCL dispatch times.

SeanNijjar commented 3 months ago

I am correct in saying that op issuance time from host for the all gather op is different across chips. That time difference plus the dispatch overhead, which should be safe to assume it is fairly consistent across chips of the same type (remote vs local) is playing into rendering algather reported FW /Kernel times unrepresentative of the actual time of the kernel?

Yes, but also because CCLs on one chip can run, and partially complete before they start on another chip, but at some point, the delay in dispatch can cause blocking of progress globally. So it's related. This is why it's helpful to be able to look at dispatch in isolation.

cglagovichTT commented 3 months ago

I re-measured e2e on branch b954a6de86a858681cbb273e4c2e9b2ce9c9184d cglagovich/llama_t3k_trace, which is rebased off of latest main with Austin's dispatch optimizations and Aditya's commit which changes how dispatch cores on remote chips are assigned. This branch also uses AllGather instead of concat on host for output logits.

From a perf dump, I see that dispatch+FW time on device is 76428659 ns (13.08 t/s/u). I also see no difference in AllGather dispatch times between L and R devices!

With a release build I got 0.0857 s per iteration -> 11.65 t/s/u. Note that this measurement does not include pushing new inputs to device, so it is an optimistic view of demo e2e perf.

We have a 9.3ms gap between dispatch+FW time as measured with the op profiler and e2e latency as measured in python. That time must account for device->host read and untilizing on host. Untilizing on device leads to an L1 OOM error which I'll file.

Dispatch still accounts for 15% of time spent on device, so that is the most actionable item for improving Llama e2e perf.

uaydonat commented 3 months ago

My executive summary so far:

15.8 t/s/u is device perf without tracing 15.5 t/s/u because allgather looks bigger with tracing 13.08 t/s/u because dispatch overheads (allgather is the biggest culprit with 20us-27us each) 11.65 t/s/u because of untilize on host

cglagovichTT commented 3 months ago

After picking up AllGather optimizations and applying a few more (dram shard ff1, fuse silu into mul), we have the following: 17.1 t/s/u FW time 14.4 t/s/u dispatch + FW time 12.84 t/s/u e2e with tracing (due to untilize) 12.05 t/s/u e2e in eager mode

If AllGather dispatch was 4µs, the dispatch+FW time would give us 16.06 t/s/u