SeanNijjar commented 3 months ago

High level issue to track the initial path-finding/clearing effort to build an all-gather + matmul fused operation that overlaps compute + data movement.

[x] #10416
- so the matmul can get full chip input chunks at a time
[x] Support all-gather targeting sub-region of worker grid with offset
[x] Add new test test mode for fusable-all-gather (all-gather portion)
- All gather fuses with datacopy op.
- Datacopy op expects 1/ring_size input each timestep. Compare output tensor for datacopy to ensure it alternates ring_index inputs
[x] Add new test test mode for fusable-all-gather (Matmul portion)
- matmul fuses with datacopy op
- datacopy op reads input tensor in chunk order that will match all-gather (alternating chip indices)
- These are the activations fed in to matmul - matmul must update weight read order accordingly and produce correct output

Adding notes from a duplicate issue that @xuncaiTT opened

At each t, matmul can be performed on the recieved chunks on idle cores, which can significantly improve utilization and latency of device. Note that the order does not matter as long as the output accumulates. See below proposed scheme:

A pseudo code is shown below:

FYI @cglagovich @xuncaiTT @avoraTT

SeanNijjar commented 3 months ago

@xuncaiTT I assume you are aware of this, but I didn't see it mentioned so I wanted to say it explicitly that the chunk schedule will be unique per chip. We will also need to coordinate a little because there is a bit of coupling between the CCL and compute because you would prefer to start computing the first chunk from a specific direction based on how the all-gather works because one direction will end up sending more chunks than the other by one.

SeanNijjar commented 1 month ago

We can close this now, right?

tenstorrent / tt-metal

Enable Compute + CCL Data Movement Overlap (Fuse All Gather + Matmul) #10415

Adding notes from a duplicate issue that @xuncaiTT opened