Open Abhishek-Varma opened 1 month ago
Regarding Inference.2: What I've been able to find is that since we remove the Matmul argument due to canonicalizatio, the bufferization pass ends up creating the corresponding copy op before the Epilogue computation starts. This is not the case when canonicalization is switched off and the copy op is created after the computation which then gets canonicalized away and we get correct IR.
CC: @MaheshRavishankar @yzhang93
Thanks @Abhishek-Varma! The CSE issue is introduced after createAMDAIEPeelForLoopPass
and canonicalization issue is introduced after createAMDAIEFuseConsumerIntoLoopPass
A summary of what is observed by running CSE. Before CSE
%6 = scf.forall (%arg0, %arg1) = (0, 0) to (128, 128) step (64, 64) shared_outs(%arg2 = %5) -> (tensor<128x128xf32>) {
// Prelog
%extracted_slice_7 = tensor.extract_slice %extracted_slice[0, 0] [64, 32] [1, 1] : tensor<64x256xbf16> to tensor<64x32xbf16>
%9 = bufferization.to_tensor %alloc_2 restrict writable : memref<2x1x32x32xbf16, 1 : i32>
%pack = tensor.pack %extracted_slice_7 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %9 : tensor<64x32xbf16> -> tensor<2x1x32x32xbf16>
... ...
// Epilog
%extracted_slice_10 = tensor.extract_slice %extracted_slice[0, 224] [64, 32] [1, 1] : tensor<64x256xbf16> to tensor<64x32xbf16>
%13 = bufferization.to_tensor %alloc_2 restrict writable : memref<2x1x32x32xbf16, 1 : i32>
%pack_11 = tensor.pack %extracted_slice_10 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %13 : tensor<64x32xbf16> -> tensor<2x1x32x32xbf16>
... ...
After CSE, it removes %13 = bufferization.to_tensor %alloc_2 restrict writable : memref<2x1x32x32xbf16, 1 : i32>
, and pack into %9 instead.
%6 = scf.forall (%arg0, %arg1) = (0, 0) to (128, 128) step (64, 64) shared_outs(%arg2 = %5) -> (tensor<128x128xf32>) {
// Prelog
%extracted_slice_7 = tensor.extract_slice %extracted_slice[0, 0] [64, 32] [1, 1] : tensor<64x256xbf16> to tensor<64x32xbf16>
%9 = bufferization.to_tensor %alloc_2 restrict writable : memref<2x1x32x32xbf16, 1 : i32>
%pack = tensor.pack %extracted_slice_7 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %9 : tensor<64x32xbf16> -> tensor<2x1x32x32xbf16>
... ...
// Epilog
%extracted_slice_10 = tensor.extract_slice %extracted_slice[0, 224] [64, 32] [1, 1] : tensor<64x256xbf16> to tensor<64x32xbf16>
%pack_11 = tensor.pack %extracted_slice_10 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %9 : tensor<64x32xbf16> -> tensor<2x1x32x32xbf16>
... ...
So, after bufferization it creates new alloc and copies for memref<2x1x32x32xbf16, 1 : i32>
as
%alloc_0 = memref.alloc() : memref<2x1x32x32xbf16, 1 : i32>
linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%alloc_5 : memref<2x1x32x32xbf16, 1 : i32>) outs(%alloc_0 : memref<2x1x32x32xbf16, 1 : i32>) {
^bb0(%in: bf16, %out: bf16):
linalg.yield %in : bf16
}
A summary of what is observed by running canonicalization pattern (ForallOpIterArgsFolder).
Before canonicalization
%16:2 = scf.forall (%arg3, %arg4) in (2, 2) shared_outs(%arg5 = %13, %arg6 = %7) -> (tensor<2x2x8x8x4x4xf32>, tensor<2x2x32x32xf32>) {
... ...
%19 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%pack_15, %pack_17 : tensor<1x1x4x8x4x8xbf16>, tensor<1x1x8x4x8x4xbf16>) outs(%extracted_slice_18 : tensor<1x1x8x8x4x4xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
^bb0(%in: bf16, %in_22: bf16, %out: f32):
%20 = arith.extf %in : bf16 to f32
%21 = arith.extf %in_22 : bf16 to f32
%22 = arith.mulf %20, %21 : f32
%23 = arith.addf %out, %22 : f32
linalg.yield %23 : f32
} -> tensor<1x1x8x8x4x4xf32>
%inserted_slice = tensor.insert_slice %19 into %arg5[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<1x1x8x8x4x4xf32> into tensor<2x2x8x8x4x4xf32>
%extracted_slice_19 = tensor.extract_slice %arg6[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<2x2x32x32xf32> to tensor<1x1x32x32xf32>
%extracted_slice_20 = tensor.extract_slice %inserted_slice[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<2x2x8x8x4x4xf32> to tensor<1x1x8x8x4x4xf32>
%unpack_21 = tensor.unpack %19 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 4] into %extracted_slice_19 : tensor<1x1x8x8x4x4xf32> -> tensor<1x1x32x32xf32>
scf.forall.in_parallel {
tensor.parallel_insert_slice %unpack_21 into %arg6[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<1x1x32x32xf32> into tensor<2x2x32x32xf32>
tensor.parallel_insert_slice %19 into %arg5[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<1x1x8x8x4x4xf32> into tensor<2x2x8x8x4x4xf32>
}
}
After Canonicalization, it removes a shared_outs
argument
%15 = scf.forall (%arg3, %arg4) in (2, 2) shared_outs(%arg5 = %7) -> (tensor<2x2x32x32xf32>) {
... ...
%18 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%pack_15, %pack_17 : tensor<1x1x4x8x4x8xbf16>, tensor<1x1x8x4x8x4xbf16>) outs(%extracted_slice_18 : tensor<1x1x8x8x4x4xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
^bb0(%in: bf16, %in_21: bf16, %out: f32):
%19 = arith.extf %in : bf16 to f32
%20 = arith.extf %in_21 : bf16 to f32
%21 = arith.mulf %19, %20 : f32
%22 = arith.addf %out, %21 : f32
linalg.yield %22 : f32
} -> tensor<1x1x8x8x4x4xf32>
%extracted_slice_19 = tensor.extract_slice %arg5[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<2x2x32x32xf32> to tensor<1x1x32x32xf32>
%unpack_20 = tensor.unpack %18 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 4] into %extracted_slice_19 : tensor<1x1x8x8x4x4xf32> -> tensor<1x1x32x32xf32>
scf.forall.in_parallel {
tensor.parallel_insert_slice %unpack_20 into %arg5[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<1x1x32x32xf32> into tensor<2x2x32x32xf32>
}
}
After bufferization it creates new alloc and copies for output memref<1x1x8x8x4x4xf32, 2 : i32>
.
%alloc = memref.alloc() : memref<1x1x8x8x4x4xf32, 2 : i32>
%subview_15 = memref.subview %alloc_5[%arg2, %arg3, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : memref<2x2x8x8x4x4xf32, 2 : i32> to memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>
linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3, d4, d5)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "parallel"]} ins(%subview_15 : memref<1x1x8x8x4x4xf32, strided<[2048, 1024, 128, 16, 4, 1], offset: ?>, 2 : i32>) outs(%alloc : memref<1x1x8x8x4x4xf32, 2 : i32>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
}
Let me leave the CSE issue aside for now, the canonicalization looks OK to me and I dont see the link from that to the bufferization allocation. Taking this
%16:2 = scf.forall (%arg3, %arg4) in (2, 2) shared_outs(%arg5 = %13, %arg6 = %7) -> (tensor<2x2x8x8x4x4xf32>, tensor<2x2x32x32xf32>) {
... ...
%19 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%pack_15, %pack_17 : tensor<1x1x4x8x4x8xbf16>, tensor<1x1x8x4x8x4xbf16>) outs(%extracted_slice_18 : tensor<1x1x8x8x4x4xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
^bb0(%in: bf16, %in_22: bf16, %out: f32):
%20 = arith.extf %in : bf16 to f32
%21 = arith.extf %in_22 : bf16 to f32
%22 = arith.mulf %20, %21 : f32
%23 = arith.addf %out, %22 : f32
linalg.yield %23 : f32
} -> tensor<1x1x8x8x4x4xf32>
%inserted_slice = tensor.insert_slice %19 into %arg5[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<1x1x8x8x4x4xf32> into tensor<2x2x8x8x4x4xf32>
%extracted_slice_19 = tensor.extract_slice %arg6[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<2x2x32x32xf32> to tensor<1x1x32x32xf32>
%extracted_slice_20 = tensor.extract_slice %inserted_slice[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<2x2x8x8x4x4xf32> to tensor<1x1x8x8x4x4xf32>
%unpack_21 = tensor.unpack %19 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 4] into %extracted_slice_19 : tensor<1x1x8x8x4x4xf32> -> tensor<1x1x32x32xf32>
scf.forall.in_parallel {
tensor.parallel_insert_slice %unpack_21 into %arg6[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<1x1x32x32xf32> into tensor<2x2x32x32xf32>
tensor.parallel_insert_slice %19 into %arg5[%arg3, %arg4, 0, 0, 0, 0] [1, 1, 8, 8, 4, 4] [1, 1, 1, 1, 1, 1] : tensor<1x1x8x8x4x4xf32> into tensor<2x2x8x8x4x4xf32>
}
}
if %16#0
is dead, then that value is never used. and the use of %arg5
is also dead. So there is no reason to keep that around. The canonicalization does not seem to be the root cause here. Something else is going off the rails.
I think the issue is again related to vectorization. If we bufferize the following IR directly
%15 = scf.forall (%arg3, %arg4) in (2, 2) shared_outs(%arg5 = %7) -> (tensor<2x2x32x32xf32>) {
... ...
%18 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d2, d5, d3, d6, d8)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d2, d1, d4, d5, d8, d7)>, affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d4, d3, d6, d7)>], iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction", "parallel", "parallel", "reduction"]} ins(%pack_15, %pack_17 : tensor<1x1x4x8x4x8xbf16>, tensor<1x1x8x4x8x4xbf16>) outs(%extracted_slice_18 : tensor<1x1x8x8x4x4xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [0, 0, 1], [1, 1, 0, 0, 0, 0]]>, packing_config = #amdaie.packing_config<packing_config = [{packedSizes = [32, 32, 32], transposePackIndices = [1], unpackEmpty = [false], innerPerm = [[1, 0]], outerPerm = [[0, 1]]}, {packedSizes = [0, 0, 0, 4, 4, 8], transposePackIndices = [0, 1, 2], unpackEmpty = [false, false, true], innerPerm = [[0, 1], [1, 0], [0, 1]], outerPerm = [[0, 1, 3, 2], [0, 1, 3, 2], [0, 1, 3, 2]]}]>} {
^bb0(%in: bf16, %in_21: bf16, %out: f32):
%19 = arith.extf %in : bf16 to f32
%20 = arith.extf %in_21 : bf16 to f32
%21 = arith.mulf %19, %20 : f32
%22 = arith.addf %out, %21 : f32
linalg.yield %22 : f32
} -> tensor<1x1x8x8x4x4xf32>
%extracted_slice_19 = tensor.extract_slice %arg5[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<2x2x32x32xf32> to tensor<1x1x32x32xf32>
%unpack_20 = tensor.unpack %18 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [2, 3] inner_tiles = [4, 4] into %extracted_slice_19 : tensor<1x1x8x8x4x4xf32> -> tensor<1x1x32x32xf32>
scf.forall.in_parallel {
tensor.parallel_insert_slice %unpack_20 into %arg5[%arg3, %arg4, 0, 0] [1, 1, 32, 32] [1, 1, 1, 1] : tensor<1x1x32x32xf32> into tensor<2x2x32x32xf32>
}
}
it is bound to create a buffer. Something needs to hold the value for %18
. This is again an issue that we arent vectorizing these operations. AFAIK we have vectorization support for tensor.unpack
operations as well. Vectorization is the only way we can avoid the allocation and the copies.
As discussed offline I've tabled off the tensor.unpack
's vectorization point as per this thread.
Now, with the new vectorization branch (along with the consumer fusion fix upstream) for Matmul + Relu, here's what I found :-
insert-cores
pass but should be easy to add fix - I can do that once we discuss the points below (because I need confirmation if we'd really go ahead with the new vectorization method OR not).f32 Relu
it fails at VectorToAIEVecConversions pass - element bit width = 32 and lane size is 16 because we have vector<1x1x1x1x4x4xf32>
. And this marks it as an illegal op and fails the lowering.bf16 Relu
it fails at AIEVecToLLVM - because it wants Relu to be a 1-D vector. @MaheshRavishankar - I recall you mentioning something along these lines - I'll try to add flattening of Relu - from what I see I could perhaps add this right before bufferization (@MaheshRavishankar correct me if I'm wrong).NOTE: The above branch linked switches off CANONICALIZE + CSE - but we still see temporary buffer issue for all cases above. NOTE: The consumer fusion fix that I talk about is needed because I've switched off CANONICALIZE.
Regarding the temporary buffer issue - I tried inspecting the IR and here's what I've inferred :-
%a = tensor<...xf32>
%b = tensor<...xbf16>
// Prologue
vectorized loop for parallel dims
READ vector from tensor %a
vectorized loop for reduction dims
WRITE vector to tensor %a
// Main
scf.for
vectorized loop for parallel dims
READ vector from tensor %a
vectorized loop for reduction dims
WRITE vector to tensor %a
// Epilogue
vectorized loop for parallel dims
READ vector from tensor %a
vectorized loop for reduction dims
Compute Relu/Elementwise
WRITE vector to tensor %b
WRITE vector to tensor %a
Since we have 3 sections whose boundaries are ONLY dealing with TENSORS, we have this extra buffer issue.
This won't solve even if %a
and %b
were of SAME data type because the 3 sections are ONLY dealing with TENSORS at the boundary exchange.
Therefore essentially we have the need of a temporary buffer for writing the output of each sections into.
(tensor) -> Prologue -> (tensor) -> Main -> (tensor) -> Epilogue -> (tensor)
(NOTE: Each Prologue
, Main
and Epilogue
are internally working on vector
)
I speculate that if we propagate vector
type and make the sections exchange/communicate data to the next via vector
- we might not be needing this temporary buffer.
So, we should therefore have the following :-
(tensor) -> Prologue -> (vector) -> Main -> (vector) -> Epilogue -> (tensor)
We clearly have quite a few set of issues.
But to possibly make Relu work as a stop-gap solution by switching off CANONICALIZE + CSE (like in the new vectorization branch I linked above), I propose doing the following :-
vector
to the sections' boundary - I currently don't know what all can of worms this itself would entail.@MaheshRavishankar please let me know your thoughts.
GOAL: Triage and solve the issue of redundant linalg.generic copies AND temporary buffer issue.
EXPERIMENT : The following experiment was done on
main
branch and for PURE MATMUL for which you may find the input IR here :-OBSERVATION :
linalg.generic
copies for LHS and RHS (experiment number 2).linalg.generic
copy for OUTPUT of Matmul - and this in turn gets us the temporary buffer issue (experiment number 3).linalg.generic
copies for LHS, RHS and OUTPUT of Matmul and temporary buffer issue (experiment number 4).INFERENCE (so far) : We have two orthogonal issues =>
cse
.NOTE : We now need to take a look at how bufferization (or some other pass is working in tandem with
CSE
/ForallOpIterArgsFolder
separately - might be that the issue is actually in bufferization OR vice-versa). Therefore needs further triaging (and an update in this same issue thread).GLOSSARY :
linalg.generic
copies