Closed SeanNijjar closed 1 month ago
@xuncaiTT I assume you are aware of this, but I didn't see it mentioned so I wanted to say it explicitly that the chunk schedule will be unique per chip. We will also need to coordinate a little because there is a bit of coupling between the CCL and compute because you would prefer to start computing the first chunk from a specific direction based on how the all-gather works because one direction will end up sending more chunks than the other by one.
We can close this now, right?
High level issue to track the initial path-finding/clearing effort to build an all-gather + matmul fused operation that overlaps compute + data movement.
Adding notes from a duplicate issue that @xuncaiTT opened
At each t, matmul can be performed on the recieved chunks on idle cores, which can significantly improve utilization and latency of device. Note that the order does not matter as long as the output accumulates. See below proposed scheme:
A pseudo code is shown below:
FYI @cglagovich @xuncaiTT @avoraTT