Closed SeanNijjar closed 1 month ago
Progress is currently being tracked on snijjar/aho/all-gather-v4
. This'll change when the PRs start coming.
Currently I'm blocked from getting my train of commits in until tunneling changes are main-lined. This is because there are some conflicts with our respective changes and I don't want to potentially delay tunnelling from main-lining. My changes are lower priority than it.
Changes on branch:
Ring All Gather
Targeted Test Cases:
The following test cases were identified as needed by certain priority models and should be working for each item above to be considered done (in addition to any other test cases in the test suite).
Interpreting this table: Currently the all-gather is run on a 4 chip ring. For each row, split the tensor shape by 4 along the
Gather Dim
. That size of tensor chunk will live on each chip at test start. The all-gather will collect the chunks along gather dim into the canonical shape.Linear Allgather
Linear allgather ops are needed for some configurations (especially in Falcon40B and Llama 2 on galaxy) where matmults may be inner dim parallelized and the partial outputs must be accumulated. Those accumulations will only be along rows/columns, not ring or mesh. In that case, we must support a linear topology all-gather because a single galaxy (or t3000) doesn't implement a torus.
This will be used as a stepping stone toward linear all-reduce.