nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
45 stars 23 forks source link

Privatize local memory for tensors #448

Open newling opened 1 week ago

newling commented 1 week ago

The pack-peel pipeline, matmul (m=n=1024, k=512) followed by addition of a bias (1-d vector with 1024 values) results in the final allocations (I've renamed the SSA values for clarity).

%bias_local = memref.alloc() : memref<1x16x4xf32, 2 : i32>
%bias_shared = memref.alloc() : memref<2x64xf32, 1 : i32>
%B_local = memref.alloc() : memref<1x1x16x8x8x4xbf16, 2 : i32>
%A_local = memref.alloc() : memref<1x1x8x16x4x8xbf16, 2 : i32>
%B_shared = memref.alloc() : memref<1x2x64x64xbf16, 1 : i32>
%A_shared = memref.alloc() : memref<2x1x64x64xbf16, 1 : i32>
%C_local = memref.alloc() : memref<2x2x16x16x4x4xf32, 2 : i32>
%C_shared = memref.alloc() : memref<2x2x64x64xf32, 1 : i32> 

The above is for a design using a 2x2 array of AIE cores. The IR contains a loop over the 2x2 cores, indexing into arrays as follows

 scf.forall (%arg2, %arg3) in (2, 2) {

// Copy from slice of shared-memory A, to local-memory for A 
%subview_14 = memref.subview %A_shared[%arg2, 0, 0, 0] [1, 1, 64, 64] [1, 1, 1, 1]...
iree_linalg_ext.pack %subview_14 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 8] into %A_local ...

%subview_15 = memref.subview %B_shared[0, %arg3, 0, 0] [1, 1, 64, 64] [1, 1, 1, 1]...
iree_linalg_ext.pack %subview_15 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [8, 4] into %B_local ...

%subview_16 = memref.subview %C_local[%arg2, %arg3, 0, 0, 0, 0] [1, 1, 16, 16, 4, 4] [1, 1, 1, 1, 1, 1]...

 ...
 }

For A and B, a view into shared memory is copied to the entire local buffer for A and B. For C, a slice of the local buffer is taken.

I find this very confusing, and think it would be much better if C was already 'privatized' per core, so that instead of

%C_local = memref.alloc() : memref<2x2x16x16x4x4xf32, 2 : i32>

the allocation was

%C_local = memref.alloc() : memref<2x2x16x16x4x4xf32, 2 : i32>

and then it would effectively just be

%subview_16 = %C_local

This seems like it would be more inline with how GPU abstraction works (I'm thinking of OpenCL kernels). I think there shouldn't ever be a contiguous block of memory representing all data memories IMO.