Redundant linalg.generic copies + temporary buffers

GOAL: Triage and solve the issue of redundant linalg.generic copies AND temporary buffer issue.

EXPERIMENT : The following experiment was done on main branch and for PURE MATMUL for which you may find the input IR here :-

No.	Canonicalize	CSE	e2e IR	Copies	Alloc
1	NO	NO	link	NO	NO
2	NO	YES	link	YES	NO
3	YES	NO	link	YES	YES
4	YES	YES	link	YES	YES
5	YES except ForallOpIterArgsFolder	NO	link	NO	NO
6	YES except ForallOpIterArgsFolder	YES	link	YES	NO

OBSERVATION :

ONLY CSE is required to trigger generation of redundant linalg.generic copies for LHS and RHS (experiment number 2).
ONLY ForallOpIterArgsFolder is required to trigger generation of redundant linalg.generic copy for OUTPUT of Matmul - and this in turn gets us the temporary buffer issue (experiment number 3).
When both are enabled we get redundant linalg.generic copies for LHS, RHS and OUTPUT of Matmul and temporary buffer issue (experiment number 4).

INFERENCE (so far) : We have two orthogonal issues =>

Redundant copies of inputs due to cse.
Redundant copies of output due to ForallOpIterArgsFolder - which is essentially why we get the temporary buffer.

NOTE : We now need to take a look at how bufferization (or some other pass is working in tandem with CSE/ForallOpIterArgsFolder separately - might be that the issue is actually in bufferization OR vice-versa). Therefore needs further triaging (and an update in this same issue thread).

GLOSSARY :

Regarding Inference.2: What I've been able to find is that since we remove the Matmul argument due to canonicalizatio, the bufferization pass ends up creating the corresponding copy op before the Epilogue computation starts. This is not the case when canonicalization is switched off and the copy op is created after the computation which then gets canonicalized away and we get correct IR.

CC: @MaheshRavishankar @yzhang93

Thanks @Abhishek-Varma! The CSE issue is introduced after createAMDAIEPeelForLoopPass and canonicalization issue is introduced after createAMDAIEFuseConsumerIntoLoopPass

A summary of what is observed by running CSE. Before CSE

%6 = scf.forall (%arg0, %arg1) = (0, 0) to (128, 128) step (64, 64) shared_outs(%arg2 = %5) -> (tensor<128x128xf32>) {
    // Prelog
    %extracted_slice_7 = tensor.extract_slice %extracted_slice[0, 0] [64, 32] [1, 1] : tensor<64x256xbf16> to tensor<64x32xbf16>
    %9 = bufferization.to_tensor %alloc_2 restrict writable : memref<2x1x32x32xbf16, 1 : i32>
    %pack = tensor.pack %extracted_slice_7 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %9 : tensor<64x32xbf16> -> tensor<2x1x32x32xbf16>
    ... ...
    // Epilog
    %extracted_slice_10 = tensor.extract_slice %extracted_slice[0, 224] [64, 32] [1, 1] : tensor<64x256xbf16> to tensor<64x32xbf16>
    %13 = bufferization.to_tensor %alloc_2 restrict writable : memref<2x1x32x32xbf16, 1 : i32>
    %pack_11 = tensor.pack %extracted_slice_10 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %13 : tensor<64x32xbf16> -> tensor<2x1x32x32xbf16>
     ... ...

After CSE, it removes %13 = bufferization.to_tensor %alloc_2 restrict writable : memref<2x1x32x32xbf16, 1 : i32>, and pack into %9 instead.

%6 = scf.forall (%arg0, %arg1) = (0, 0) to (128, 128) step (64, 64) shared_outs(%arg2 = %5) -> (tensor<128x128xf32>) {
    // Prelog
    %extracted_slice_7 = tensor.extract_slice %extracted_slice[0, 0] [64, 32] [1, 1] : tensor<64x256xbf16> to tensor<64x32xbf16>
    %9 = bufferization.to_tensor %alloc_2 restrict writable : memref<2x1x32x32xbf16, 1 : i32>
    %pack = tensor.pack %extracted_slice_7 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %9 : tensor<64x32xbf16> -> tensor<2x1x32x32xbf16>
    ... ...
    // Epilog
    %extracted_slice_10 = tensor.extract_slice %extracted_slice[0, 224] [64, 32] [1, 1] : tensor<64x256xbf16> to tensor<64x32xbf16>
    %pack_11 = tensor.pack %extracted_slice_10 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %9 : tensor<64x32xbf16> -> tensor<2x1x32x32xbf16>
     ... ...

So, after bufferization it creates new alloc and copies for memref<2x1x32x32xbf16, 1 : i32> as

%alloc_0 = memref.alloc() : memref<2x1x32x32xbf16, 1 : i32>
linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%alloc_5 : memref<2x1x32x32xbf16, 1 : i32>) outs(%alloc_0 : memref<2x1x32x32xbf16, 1 : i32>) {
^bb0(%in: bf16, %out: bf16):
  linalg.yield %in : bf16
}

A summary of what is observed by running canonicalization pattern (ForallOpIterArgsFolder).

Before canonicalization

%16:2 = scf.forall (%arg3, %arg4) in (2, 2) shared_outs(%arg5 = %13, %arg6 = %7) -> (tensor<2x2x8x8x4x4xf32>, tensor<2x2x32x32xf32>) {
      ... ...
      %19 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%pack_15, %pack_17 : tensor<1x1x4x8x4x8xbf16>, tensor<1x1x8x4x8x4xbf16>) outs(%extracted_slice_18 : tensor<1x1x8x8x4x4xf32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
      ^bb0(%in: bf16, %in_22: bf16, %out: f32):
        %20 = arith.extf %in : bf16 to f32
        %21 = arith.extf %in_22 : bf16 to f32
        %22 = arith.mulf %20, %21 : f32
        %23 = arith.addf %out, %22 : f32
        linalg.yield %23 : f32
      } -> tensor<1x1x8x8x4x4xf32>
      %inserted_slice = tensor.insert_slice %19 into %arg5[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<1x1x8x8x4x4xf32> into tensor<2x2x8x8x4x4xf32>
      %extracted_slice_19 = tensor.extract_slice %arg6[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<2x2x32x32xf32> to tensor<1x1x32x32xf32>
      %extracted_slice_20 = tensor.extract_slice %inserted_slice[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<2x2x8x8x4x4xf32> to tensor<1x1x8x8x4x4xf32>
      %unpack_21 = tensor.unpack %19 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 4] into %extracted_slice_19 : tensor<1x1x8x8x4x4xf32> -> tensor<1x1x32x32xf32>
      scf.forall.in_parallel {
        tensor.parallel_insert_slice %unpack_21 into %arg6[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<1x1x32x32xf32> into tensor<2x2x32x32xf32>
        tensor.parallel_insert_slice %19 into %arg5[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<1x1x8x8x4x4xf32> into tensor<2x2x8x8x4x4xf32>
      }
    }

After Canonicalization, it removes a shared_outs argument

%15 = scf.forall (%arg3, %arg4) in (2, 2) shared_outs(%arg5 = %7) -> (tensor<2x2x32x32xf32>) {
      ... ...
      %18 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%pack_15, %pack_17 : tensor<1x1x4x8x4x8xbf16>, tensor<1x1x8x4x8x4xbf16>) outs(%extracted_slice_18 : tensor<1x1x8x8x4x4xf32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
      ^bb0(%in: bf16, %in_21: bf16, %out: f32):
        %19 = arith.extf %in : bf16 to f32
        %20 = arith.extf %in_21 : bf16 to f32
        %21 = arith.mulf %19, %20 : f32
        %22 = arith.addf %out, %21 : f32
        linalg.yield %22 : f32
      } -> tensor<1x1x8x8x4x4xf32>
      %extracted_slice_19 = tensor.extract_slice %arg5[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<2x2x32x32xf32> to tensor<1x1x32x32xf32>
      %unpack_20 = tensor.unpack %18 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 4] into %extracted_slice_19 : tensor<1x1x8x8x4x4xf32> -> tensor<1x1x32x32xf32>
      scf.forall.in_parallel {
        tensor.parallel_insert_slice %unpack_20 into %arg5[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<1x1x32x32xf32> into tensor<2x2x32x32xf32>
      }
    }

After bufferization it creates new alloc and copies for output memref<1x1x8x8x4x4xf32, 2 : i32>.

%alloc = memref.alloc() : memref<1x1x8x8x4x4xf32, 2 : i32>
%subview_15 = memref.subview %alloc_5[%arg2, %arg3, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : memref<2x2x8x8x4x4xf32, 2 : i32> to memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>
linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%subview_15 : memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>) outs(%alloc : memref<1x1x8x8x4x4xf32, 2 : i32>) {
^bb0(%in: f32, %out: f32):
  linalg.yield %in : f32
}

Let me leave the CSE issue aside for now, the canonicalization looks OK to me and I dont see the link from that to the bufferization allocation. Taking this

%16:2 = scf.forall (%arg3, %arg4) in (2, 2) shared_outs(%arg5 = %13, %arg6 = %7) -> (tensor<2x2x8x8x4x4xf32>, tensor<2x2x32x32xf32>) {
      ... ...
      %19 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%pack_15, %pack_17 : tensor<1x1x4x8x4x8xbf16>, tensor<1x1x8x4x8x4xbf16>) outs(%extracted_slice_18 : tensor<1x1x8x8x4x4xf32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
      ^bb0(%in: bf16, %in_22: bf16, %out: f32):
        %20 = arith.extf %in : bf16 to f32
        %21 = arith.extf %in_22 : bf16 to f32
        %22 = arith.mulf %20, %21 : f32
        %23 = arith.addf %out, %22 : f32
        linalg.yield %23 : f32
      } -> tensor<1x1x8x8x4x4xf32>
      %inserted_slice = tensor.insert_slice %19 into %arg5[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<1x1x8x8x4x4xf32> into tensor<2x2x8x8x4x4xf32>
      %extracted_slice_19 = tensor.extract_slice %arg6[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<2x2x32x32xf32> to tensor<1x1x32x32xf32>
      %extracted_slice_20 = tensor.extract_slice %inserted_slice[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<2x2x8x8x4x4xf32> to tensor<1x1x8x8x4x4xf32>
      %unpack_21 = tensor.unpack %19 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 4] into %extracted_slice_19 : tensor<1x1x8x8x4x4xf32> -> tensor<1x1x32x32xf32>
      scf.forall.in_parallel {
        tensor.parallel_insert_slice %unpack_21 into %arg6[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<1x1x32x32xf32> into tensor<2x2x32x32xf32>
        tensor.parallel_insert_slice %19 into %arg5[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<1x1x8x8x4x4xf32> into tensor<2x2x8x8x4x4xf32>
      }
    }

if %16#0 is dead, then that value is never used. and the use of %arg5 is also dead. So there is no reason to keep that around. The canonicalization does not seem to be the root cause here. Something else is going off the rails. I think the issue is again related to vectorization. If we bufferize the following IR directly

%15 = scf.forall (%arg3, %arg4) in (2, 2) shared_outs(%arg5 = %7) -> (tensor<2x2x32x32xf32>) {
      ... ...
      %18 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%pack_15, %pack_17 : tensor<1x1x4x8x4x8xbf16>, tensor<1x1x8x4x8x4xbf16>) outs(%extracted_slice_18 : tensor<1x1x8x8x4x4xf32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
      ^bb0(%in: bf16, %in_21: bf16, %out: f32):
        %19 = arith.extf %in : bf16 to f32
        %20 = arith.extf %in_21 : bf16 to f32
        %21 = arith.mulf %19, %20 : f32
        %22 = arith.addf %out, %21 : f32
        linalg.yield %22 : f32
      } -> tensor<1x1x8x8x4x4xf32>
      %extracted_slice_19 = tensor.extract_slice %arg5[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<2x2x32x32xf32> to tensor<1x1x32x32xf32>
      %unpack_20 = tensor.unpack %18 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 4] into %extracted_slice_19 : tensor<1x1x8x8x4x4xf32> -> tensor<1x1x32x32xf32>
      scf.forall.in_parallel {
        tensor.parallel_insert_slice %unpack_20 into %arg5[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<1x1x32x32xf32> into tensor<2x2x32x32xf32>
      }
    }

it is bound to create a buffer. Something needs to hold the value for %18. This is again an issue that we arent vectorizing these operations. AFAIK we have vectorization support for tensor.unpack operations as well. Vectorization is the only way we can avoid the allocation and the copies.

As discussed offline I've tabled off the tensor.unpack's vectorization point as per this thread.

Now, with the new vectorization branch (along with the consumer fusion fix upstream) for Matmul + Relu, here's what I found :-

Support for pure Matmul needs to go in. Currently it'd fail at insert-cores pass but should be easy to add fix - I can do that once we discuss the points below (because I need confirmation if we'd really go ahead with the new vectorization method OR not).
For vectorized f32 Relu it fails at VectorToAIEVecConversions pass - element bit width = 32 and lane size is 16 because we have vector<1x1x1x1x4x4xf32>. And this marks it as an illegal op and fails the lowering.
For vectorized bf16 Relu it fails at AIEVecToLLVM - because it wants Relu to be a 1-D vector. @MaheshRavishankar - I recall you mentioning something along these lines - I'll try to add flattening of Relu - from what I see I could perhaps add this right before bufferization (@MaheshRavishankar correct me if I'm wrong).

NOTE: The above branch linked switches off CANONICALIZE + CSE - but we still see temporary buffer issue for all cases above. NOTE: The consumer fusion fix that I talk about is needed because I've switched off CANONICALIZE.

Regarding the temporary buffer issue - I tried inspecting the IR and here's what I've inferred :-

%a = tensor<...xf32>
%b = tensor<...xbf16>

// Prologue
           vectorized loop for parallel dims
                   READ vector from tensor %a
                          vectorized loop for reduction dims
                   WRITE vector to tensor %a

// Main
      scf.for
           vectorized loop for parallel dims
                   READ vector from tensor %a
                          vectorized loop for reduction dims
                   WRITE vector to tensor %a

// Epilogue
           vectorized loop for parallel dims
                   READ vector from tensor %a
                          vectorized loop for reduction dims
                   Compute Relu/Elementwise
                   WRITE vector to tensor %b
                   WRITE vector to tensor %a

Since we have 3 sections whose boundaries are ONLY dealing with TENSORS, we have this extra buffer issue. This won't solve even if %a and %b were of SAME data type because the 3 sections are ONLY dealing with TENSORS at the boundary exchange.

Therefore essentially we have the need of a temporary buffer for writing the output of each sections into.

(tensor) -> Prologue -> (tensor) -> Main  -> (tensor) -> Epilogue -> (tensor)

(NOTE: Each Prologue, Main and Epilogue are internally working on vector)

I speculate that if we propagate vector type and make the sections exchange/communicate data to the next via vector - we might not be needing this temporary buffer.

So, we should therefore have the following :-

(tensor) -> Prologue -> (vector) -> Main -> (vector) -> Epilogue -> (tensor)

We clearly have quite a few set of issues.

But to possibly make Relu work as a stop-gap solution by switching off CANONICALIZE + CSE (like in the new vectorization branch I linked above), I propose doing the following :-

Propagate vector to the sections' boundary - I currently don't know what all can of worms this itself would entail.
Flatten bf16 vectorized Relu/arith.maxf/Element-wise - I'm speculating that this needs to go in as a separate pass right before bufferization.

@MaheshRavishankar please let me know your thoughts.

nod-ai / iree-amd-aie

Redundant linalg.generic copies + temporary buffers #798