Ring all gather for in0 in matmul 1D

tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

Apache License 2.0

460 stars 70 forks source link

Ring all gather for in0 in matmul 1D #12995

Open johanna-rock-tt opened 1 month ago

johanna-rock-tt commented 1 month ago

Problem description Mcasting m0 for matmul1D is a bottleneck due to NOC congestion.

Proposed solution Ring all gather in0, where in each step of the matmul computation the new local chunk is processed using the corresponding weight chunk and in parallel the local chunk is sent to the next node in the ring.

johanna-rock-tt commented 1 month ago

Initially: activation sharded across 24 cores

Compute on local chunk and corresponding weight chunk (need to know step id), and accumulate to output
Send local chunk point to point along ring (of 24 cores) to next core

Repeat at 2.

avoraTT commented 1 month ago

Let's assume in1 is interleaved in DRAM at the moment?

johanna-rock-tt commented 1 month ago

Yes, let's assume in1 is in DRAM, and for measuring perf we can comment in1 so that we mimic it being already locally on the cores in the correct format/order.

johanna-rock-tt commented 1 month ago

Let's collect all information, assumptions, and plans for the implementation here.