tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
469 stars 73 forks source link

New Op: All Gather Matmul #10460

Closed caixunshiren closed 3 months ago

caixunshiren commented 3 months ago

Description

We propose overlapping the all gather op and the matmul op by executing partial out product of the matmul whenever a chunk is recieved in during the ring all gather. A ring all gather is shown below (beautiful diagram created by @SeanNijjar ):

image

At each t, matmul can be performed on the recieved chunks on idle cores, which can significantly improve utilization and latency of device. Note that the order does not matter as long as the output accumulates. See below proposed scheme:

image

A pseudo code is shown below:

image

FYI: @cglagovichTT @SeanNijjar

cglagovichTT commented 3 months ago

Pseudocode looks good. Looks like your diagram shows a somewhat random ordering of timesteps - this might be showing that the op can support whatever arbitrary order the core receives remote data in?

SeanNijjar commented 3 months ago

Closing this as duplicate of https://github.com/tenstorrent/tt-metal/issues/10415