Ring All Gather

[x] #6042
- [ ] With acceptable performance (w/in 15% of microbenchmarks)
[x] #6040
[x] #6041

Targeted Test Cases:

The following test cases were identified as needed by certain priority models and should be working for each item above to be considered done (in addition to any other test cases in the test suite).

Interpreting this table: Currently the all-gather is run on a 4 chip ring. For each row, split the tensor shape by 4 along the Gather Dim. That size of tensor chunk will live on each chip at test start. The all-gather will collect the chunks along gather dim into the canonical shape.

Canonical Shape	Gather Dim
[1, 1, 32, 32768],	3
[1, 1, 32, 32768],	3
[1, 1, 32, 16384],	3
[1, 1, 32, 16384],	3
[1, 1, 32, 8192],	3
[1, 1, 32, 8192],	3
[1, 1, 32, 4096],	3
[1, 1, 32, 4096],	3
[1, 1, 2048, 8192],	3
[1, 1, 2048, 8192],	3
[1, 1, 2048, 4096],	3
[1, 1, 2048, 4096],	3
[1, 1, 2048, 32768],	3
[1, 1, 2048, 32768],	3
[1, 1, 2048, 16384],	3
[1, 1, 2048, 16384],	3
[1, 1, 32768, 32768],	3
[1, 1, 32768, 32768],	3
[1, 1, 32768, 16384],	3
[1, 1, 32768, 16384],	3
[1, 1, 128, 1024],	2
[1, 1, 128, 1024],	2
[1, 1, 128, 4096],	2
[1, 1, 128, 4096],	2
[1, 1, 8192, 32],	2
[1, 1, 8192, 32],	2
[1, 1, 1024, 128],	3
[1, 1, 1024, 128],	3
[1, 1, 16384, 32],	2
[1, 1, 16384, 32],	2
[1, 1, 4096, 128],	3
[1, 1, 4096, 128],	3
[1, 1, 128, 2048],	2
[1, 1, 128, 2048],	2
[1, 1, 128, 8192],	2
[1, 1, 128, 8192],	2

Linear Allgather

Linear allgather ops are needed for some configurations (especially in Falcon40B and Llama 2 on galaxy) where matmults may be inner dim parallelized and the partial outputs must be accumulated. Those accumulations will only be along rows/columns, not ring or mesh. In that case, we must support a linear topology all-gather because a single galaxy (or t3000) doesn't implement a torus.

This will be used as a stepping stone toward linear all-reduce.

[x] #6048

Progress is currently being tracked on snijjar/aho/all-gather-v4. This'll change when the PRs start coming.

Currently I'm blocked from getting my train of commits in until tunneling changes are main-lined. This is because there are some conflicts with our respective changes and I don't want to potentially delay tunnelling from main-lining. My changes are lower priority than it.

Changes on branch:

EDM incorporated to all-gather
Multi-transaction-channel enabled in all-gather with EDM
Acceptable perf achieved (unidirectional) 10+GBps for some tensor sizes
Bidirectional support
- Some generalization work still potentially needed here (hitting hangs with some channel count/buffer size configurations) but this has so far been attributed to a runtime/dispatcher issue.
- Further perf improvements still needed: tuning on bidirectional only resulted in a marginal perf bump - suspecting erisc as bottleneck.

tenstorrent / tt-metal

All Gather Support #6039

Ring All Gather

Targeted Test Cases:

Linear Allgather