Open newling opened 5 months ago
After speaking with @yzhang93 it sounds like the reason to retain both local buffers for C is that they might be different shapes if, for example, the elementwise operation (a linalg.generic) contains a transpose. Example:
I guess long-term we can try other solutions:
Also possible that something lower in the stack will optimize this away already
(pack-peel pipeline)
The final allocations for a matmul with (M=N=1024 K=512) with bias (1-D vector of 1024) elements is
There are 2 allocations in local memory (memory space '2' above) for the result tensor. The first (%alloc) is used to accumulate the matmul, and the second (%alloc_7) is used to store the result of adding the bias (%alloc_1) to the matmul accumulation (%alloc).
Ideally the addition of the bias to the matmul accumulator could be done inplace, i.e. %alloc_7 should reuse %alloc.