cglagovichTT commented 2 months ago

Issue tracking the difference between the expected bfp8_b AllGather perf in Llama (24 µs) and the values measured with tracing (32 µs)

Note: below added by @SeanNijjar

https://github.com/tenstorrent/tt-metal/issues/12395
- Exact perf gain hard to characterize since it is improving a non-deterministic and probabilistic effect but we save atleast low several microseconds on average (sometimes better, sometimes slightly worse)
https://github.com/tenstorrent/tt-metal/issues/12398
- ~1us savings for llama config

SeanNijjar commented 2 months ago

Started taking a look at this. I'm noticing some huge swings run-to-run with the all-gather perf in trace mode (15k+ cycles). My initial suspicion is that because I haven't enabled the handshaking optimization (which avoids context switching), we're really vulnerable to context switches here and seeing large swings due to that. My reasoning is that this completely makes sense because all chips start near the same time.

For non-trace, that last chip to be programmed will not need to wait for handshake at all and so will be able to proceed right away without context switching. I'll try putting in the less context-switch-eager handshake mode and see what we get.

SeanNijjar commented 2 months ago

Reproduced the behaviour with single all-gather op which seems to confirm that this is a measurement issue that got us to the 23k cycles.

For example, in the initial run during trace capture, I see 25k cycles for the op but in the trace I see over 30k cycles (upwards of 45k cycles in some cases).

However, one interesting thing I saw was that even without @xuncaiTT's I saw 2 of the 8 chips with 10k cycle shorter op-to-op latency than all the other chips:

Pinged @tt-asaigal about this offline.

What I suspect is this delta is caused by the variation in op-to-op latency of the first trace invocation - e.g. 6504983 for the "faster" ones and roughly 6516367 for all the slower ones - a roughly 10-12k gap.

If the above is the cause for the delta, then it could be that this 10-12k delta "persists" across iterations? This doesn't really explain the delta between the traced and non-traced numbers though. That being said, it does indicate that the non-traced version is likely benefitting from dispatch overlaps -> the faster chips are dispatched and waiting for the slower chips and while they are, they can partially "execute" the op for some of the chips while the slower chip is still being programmed with the kernels.

SeanNijjar commented 2 months ago

There are a few things happening here that are all separate but end up feeding into this delta between untraced all-gather kernel time for all-gather standalone vs traced kernel time of all-gather:

1) CCL Perf measurement methodology issue 2) Eth dataflow APIs (too) eagerly context switching 3) Inefficient EDM handshaking (mostly due to the above) but also because we actually need to handshake and because how low level details about when/how the handshake is started 4) Dispatch time variability between chips in trace mode (the comment above)

1. Perf Eval Methodology

Previous performance numbers were reported without trace enabled. This was done by taking the FW duration times from the last chip to receive the program for the CCL op and using that. The problem with this approach is that it leaves too much time for dispatch on other devices to overlap with compute on chips that start their programs earlier. A diagram below outlines this basic problem. Fundamentally, this problem will always exist with CCLs although to a far smaller degree when tracing is enabled.

In the example above, for sake of discussion, assume we are running ring all-gather on 4 chips and the order that devices receive "go" signals in is 0, 1, 3, 2 (picked this order to show this effect in the worst case).

Device 0 receives the go signal. The workers can start executing the all-gather, whatever they can, locally. They can start to fill CBs and EDM sender channel buffers while those EDMs are waiting for handshakes to complete with the other end of the link.
Device 1 receives go signal.
- Workers start.
- EDM handshakes with device 0 complete.
- Device 0 workers can start forwarding more data from device 1 into CBs in the direction toward device 3.
- Device 1 can "drain" it's full input from device 0
Device 3 receives go signal
- Similar events happen for device 0 -> device 3 as device 0 -> device 1.
- Both device 1 and 3 can fully "drain" their inputs from device 0
- Additionally, device 1 may be able to drain full input from device 3 (through device 0), and similarly, device 3 may be able to drain some or all of the input from device 1
At this point, the last device (2) receives go signal, and a non-trivial portion of the all-gather is "complete"
- the full ring is connected and the rest of the operation completes over time

Moving forward, all ccl op measurement (atleast for purpose of handoff) should be done exclusively in trace mode.

2. Eth Dataflow APIs Are Overly Eager to Context Switch

The default eth dataflow APIs will context switch immediately when eth tx cmd queus are full or messages are not yet available on a channel. The main EDM loop does not use these APIs but its handshake code still does. This noticeably affects performance and variability, both negatively.

3. EDM Hanshaking Inefficiency

In addition to # 2 above, the EDM handshake itself can be implemented more efficiently. For example, the handshake is initiated/completed completely after EDM channel args are read and channels are initialized. Some of the handshake can be executed before any args are read or channels initialized (especially for whoever is designated as the "master" in the handshake).

I have put in an initial improvement of the handshake addressing point 2 above as well as a mostly improved implementation and am seeing a 3-4.2k cycle improvement for this particular all-gather, when in trace mode. There is still a further tweak I can apply on top of this which will slightly improve consistency in performance here (will benefit ~1/3 of handshakes if the tracy diagram is an accurate representation) but is only expected to save maybe 200-300 ns in those cases - nothing earth shattering with that one.

4. Dispatch Time Variability In Trace

As mentioned in the earlier comment, when running this all-gather on t3000 in trace, I noticed that 2 of the 8 chips have an average of around 10k-12k shorter dispatch time compared to the other. It's definitely worth understanding the cause of this as it may point to something we can leverage for improved perf across the board. Note that the numbers above are before @xuncaiTT's dispatch optimization so it's surprising that for 2 chips, without optimization, we're seeing nearly the same dispatch time as what Jack was seeing with runtime arg/kernel creation optimization for all-gather.

SeanNijjar commented 2 months ago

Won't include as dependency but listing for sharing: https://github.com/tenstorrent/tt-metal/issues/12485

tenstorrent / tt-metal

Bfp8 AllGather Llama optimizations #12107

1. Perf Eval Methodology

2. Eth Dataflow APIs Are Overly Eager to Context Switch

3. EDM Hanshaking Inefficiency

4. Dispatch Time Variability In Trace