tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
423 stars 54 forks source link

All Gather Support #6039

Closed SeanNijjar closed 1 month ago

SeanNijjar commented 7 months ago

Ring All Gather

Targeted Test Cases:

The following test cases were identified as needed by certain priority models and should be working for each item above to be considered done (in addition to any other test cases in the test suite).

Interpreting this table: Currently the all-gather is run on a 4 chip ring. For each row, split the tensor shape by 4 along the Gather Dim. That size of tensor chunk will live on each chip at test start. The all-gather will collect the chunks along gather dim into the canonical shape.

Canonical Shape Gather Dim
[1, 1, 32, 32768], 3
[1, 1, 32, 32768], 3
[1, 1, 32, 16384], 3
[1, 1, 32, 16384], 3
[1, 1, 32, 8192], 3
[1, 1, 32, 8192], 3
[1, 1, 32, 4096], 3
[1, 1, 32, 4096], 3
[1, 1, 2048, 8192], 3
[1, 1, 2048, 8192], 3
[1, 1, 2048, 4096], 3
[1, 1, 2048, 4096], 3
[1, 1, 2048, 32768], 3
[1, 1, 2048, 32768], 3
[1, 1, 2048, 16384], 3
[1, 1, 2048, 16384], 3
[1, 1, 32768, 32768], 3
[1, 1, 32768, 32768], 3
[1, 1, 32768, 16384], 3
[1, 1, 32768, 16384], 3
[1, 1, 128, 1024], 2
[1, 1, 128, 1024], 2
[1, 1, 128, 4096], 2
[1, 1, 128, 4096], 2
[1, 1, 8192, 32], 2
[1, 1, 8192, 32], 2
[1, 1, 1024, 128], 3
[1, 1, 1024, 128], 3
[1, 1, 16384, 32], 2
[1, 1, 16384, 32], 2
[1, 1, 4096, 128], 3
[1, 1, 4096, 128], 3
[1, 1, 128, 2048], 2
[1, 1, 128, 2048], 2
[1, 1, 128, 8192], 2
[1, 1, 128, 8192], 2

Linear Allgather

Linear allgather ops are needed for some configurations (especially in Falcon40B and Llama 2 on galaxy) where matmults may be inner dim parallelized and the partial outputs must be accumulated. Those accumulations will only be along rows/columns, not ring or mesh. In that case, we must support a linear topology all-gather because a single galaxy (or t3000) doesn't implement a torus.

This will be used as a stepping stone toward linear all-reduce.

SeanNijjar commented 7 months ago

Progress is currently being tracked on snijjar/aho/all-gather-v4. This'll change when the PRs start coming.

Currently I'm blocked from getting my train of commits in until tunneling changes are main-lined. This is because there are some conflicts with our respective changes and I don't want to potentially delay tunnelling from main-lining. My changes are lower priority than it.

Changes on branch: