Optimization to remove unnecessary local memory allocation

nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator

Apache License 2.0

68 stars 29 forks source link

(pack-peel pipeline)

The final allocations for a matmul with (M=N=1024 K=512) with bias (1-D vector of 1024) elements is

%alloc = memref.alloc() : memref<2x2x16x16x4x4xf32, 2 : i32>
%alloc_1 = memref.alloc() : memref<1x16x4xf32, 2 : i32>
%alloc_2 = memref.alloc() : memref<2x64xf32, 1 : i32>
%alloc_3 = memref.alloc() : memref<1x1x16x8x8x4xbf16, 2 : i32>
%alloc_4 = memref.alloc() : memref<1x1x8x16x4x8xbf16, 2 : i32>
%alloc_5 = memref.alloc() : memref<1x2x64x64xbf16, 1 : i32>
%alloc_6 = memref.alloc() : memref<2x1x64x64xbf16, 1 : i32>
%alloc_7 = memref.alloc() : memref<2x2x16x16x4x4xf32, 2 : i32>
%alloc_8 = memref.alloc() : memref<2x2x64x64xf32, 1 : i32>

There are 2 allocations in local memory (memory space '2' above) for the result tensor. The first (%alloc) is used to accumulate the matmul, and the second (%alloc_7) is used to store the result of adding the bias (%alloc_1) to the matmul accumulation (%alloc).

Ideally the addition of the bias to the matmul accumulator could be done inplace, i.e. %alloc_7 should reuse %alloc.

After speaking with @yzhang93 it sounds like the reason to retain both local buffers for C is that they might be different shapes if, for example, the elementwise operation (a linalg.generic) contains a transpose. Example:

https://github.com/nod-ai/iree-amd-aie/pull/373/files#diff-947286de6355b02eefdd0a00b47703d65d06a573712f476aed9651093150f95bR67

I guess long-term we can try other solutions:

Do transpose in DMA while copying back
Move transpose from the linalg.generic into the matmul
Liveness analysis to determine that the A and B tensors in local memory are dead by the time the second C allocation is needed: re-use that memory.

Also possible that something lower in the stack will optimize this away already

nod-ai / iree-amd-aie

Optimization to remove unnecessary local memory allocation #446