Open johanna-rock-tt opened 1 month ago
Initially: activation sharded across 24 cores
Repeat at 2.
Let's assume in1 is interleaved in DRAM at the moment?
Yes, let's assume in1 is in DRAM, and for measuring perf we can comment in1 so that we mimic it being already locally on the cores in the correct format/order.
Let's collect all information, assumptions, and plans for the implementation here.
Problem description Mcasting m0 for matmul1D is a bottleneck due to NOC congestion.
Proposed solution Ring all gather in0, where in each step of the matmul computation the new local chunk is processed using the corresponding weight chunk and in parallel the local chunk is sent to the next node in the ring.