tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
463 stars 70 forks source link

Enable Compute + CCL Data Movement Overlap (Fuse All Gather + Matmul) #10415

Closed SeanNijjar closed 1 month ago

SeanNijjar commented 3 months ago

High level issue to track the initial path-finding/clearing effort to build an all-gather + matmul fused operation that overlaps compute + data movement.

Adding notes from a duplicate issue that @xuncaiTT opened

At each t, matmul can be performed on the recieved chunks on idle cores, which can significantly improve utilization and latency of device. Note that the order does not matter as long as the output accumulates. See below proposed scheme: image

A pseudo code is shown below: image

FYI @cglagovich @xuncaiTT @avoraTT

SeanNijjar commented 3 months ago

@xuncaiTT I assume you are aware of this, but I didn't see it mentioned so I wanted to say it explicitly that the chunk schedule will be unique per chip. We will also need to coordinate a little because there is a bit of coupling between the CCL and compute because you would prefer to start computing the first chunk from a specific direction based on how the all-gather works because one direction will end up sending more chunks than the other by one.

SeanNijjar commented 1 month ago

We can close this now, right?