tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
396 stars 49 forks source link

Bfp8 AllGather Llama optimizations #12107

Open cglagovichTT opened 2 weeks ago

cglagovichTT commented 2 weeks ago

Issue tracking the difference between the expected bfp8_b AllGather perf in Llama (24 µs) and the values measured with tracing (32 µs)

Note: below added by @SeanNijjar

SeanNijjar commented 1 week ago

Started taking a look at this. I'm noticing some huge swings run-to-run with the all-gather perf in trace mode (15k+ cycles). My initial suspicion is that because I haven't enabled the handshaking optimization (which avoids context switching), we're really vulnerable to context switches here and seeing large swings due to that. My reasoning is that this completely makes sense because all chips start near the same time.

For non-trace, that last chip to be programmed will not need to wait for handshake at all and so will be able to proceed right away without context switching. I'll try putting in the less context-switch-eager handshake mode and see what we get.

SeanNijjar commented 1 week ago

Reproduced the behaviour with single all-gather op which seems to confirm that this is a measurement issue that got us to the 23k cycles.

For example, in the initial run during trace capture, I see 25k cycles for the op but in the trace I see over 30k cycles (upwards of 45k cycles in some cases).

However, one interesting thing I saw was that even without @xuncaiTT's I saw 2 of the 8 chips with 10k cycle shorter op-to-op latency than all the other chips:

image Pinged @tt-asaigal about this offline.

What I suspect is this delta is caused by the variation in op-to-op latency of the first trace invocation - e.g. 6504983 for the "faster" ones and roughly 6516367 for all the slower ones - a roughly 10-12k gap.

If the above is the cause for the delta, then it could be that this 10-12k delta "persists" across iterations? This doesn't really explain the delta between the traced and non-traced numbers though. That being said, it does indicate that the non-traced version is likely benefitting from dispatch overlaps -> the faster chips are dispatched and waiting for the slower chips and while they are, they can partially "execute" the op for some of the chips while the slower chip is still being programmed with the kernels.

SeanNijjar commented 1 week ago

There are a few things happening here that are all separate but end up feeding into this delta between untraced all-gather kernel time for all-gather standalone vs traced kernel time of all-gather:

1) CCL Perf measurement methodology issue 2) Eth dataflow APIs (too) eagerly context switching 3) Inefficient EDM handshaking (mostly due to the above) but also because we actually need to handshake and because how low level details about when/how the handshake is started 4) Dispatch time variability between chips in trace mode (the comment above)

1. Perf Eval Methodology

Previous performance numbers were reported without trace enabled. This was done by taking the FW duration times from the last chip to receive the program for the CCL op and using that. The problem with this approach is that it leaves too much time for dispatch on other devices to overlap with compute on chips that start their programs earlier. A diagram below outlines this basic problem. Fundamentally, this problem will always exist with CCLs although to a far smaller degree when tracing is enabled.

image

In the example above, for sake of discussion, assume we are running ring all-gather on 4 chips and the order that devices receive "go" signals in is 0, 1, 3, 2 (picked this order to show this effect in the worst case).

Moving forward, all ccl op measurement (atleast for purpose of handoff) should be done exclusively in trace mode.

2. Eth Dataflow APIs Are Overly Eager to Context Switch

The default eth dataflow APIs will context switch immediately when eth tx cmd queus are full or messages are not yet available on a channel. The main EDM loop does not use these APIs but its handshake code still does. This noticeably affects performance and variability, both negatively.

3. EDM Hanshaking Inefficiency

In addition to # 2 above, the EDM handshake itself can be implemented more efficiently. For example, the handshake is initiated/completed completely after EDM channel args are read and channels are initialized. Some of the handshake can be executed before any args are read or channels initialized (especially for whoever is designated as the "master" in the handshake).

I have put in an initial improvement of the handshake addressing point 2 above as well as a mostly improved implementation and am seeing a 3-4.2k cycle improvement for this particular all-gather, when in trace mode. There is still a further tweak I can apply on top of this which will slightly improve consistency in performance here (will benefit ~1/3 of handshakes if the tracy diagram is an accurate representation) but is only expected to save maybe 200-300 ns in those cases - nothing earth shattering with that one.

4. Dispatch Time Variability In Trace

As mentioned in the earlier comment, when running this all-gather on t3000 in trace, I noticed that 2 of the 8 chips have an average of around 10k-12k shorter dispatch time compared to the other. It's definitely worth understanding the cause of this as it may point to something we can leverage for improved perf across the board. Note that the numbers above are before @xuncaiTT's dispatch optimization so it's surprising that for 2 chips, without optimization, we're seeing nearly the same dispatch time as what Jack was seeing with runtime arg/kernel creation optimization for all-gather.

SeanNijjar commented 4 days ago

Won't include as dependency but listing for sharing: https://github.com/tenstorrent/tt-metal/issues/12485