Closed caixunshiren closed 3 months ago
Pseudocode looks good. Looks like your diagram shows a somewhat random ordering of timesteps - this might be showing that the op can support whatever arbitrary order the core receives remote data in?
Closing this as duplicate of https://github.com/tenstorrent/tt-metal/issues/10415
Description
We propose overlapping the all gather op and the matmul op by executing partial out product of the matmul whenever a chunk is recieved in during the ring all gather. A ring all gather is shown below (beautiful diagram created by @SeanNijjar ):
At each t, matmul can be performed on the recieved chunks on idle cores, which can significantly improve utilization and latency of device. Note that the order does not matter as long as the output accumulates. See below proposed scheme:
A pseudo code is shown below:
FYI: @cglagovichTT @SeanNijjar