Closed jpienaar closed 1 year ago
Setting vectorizeGatherAccesses to false avoids the crash.
Sorry that you are hitting this. This is most likely due to https://reviews.llvm.org/D143429 (that I authored, so mea culpa), which was required to unblock the vectorisation of tosa.resize
.
I can take a look at this, just need a repro first :)
Setting vectorizeGatherAccesses to false avoids the crash.
Yeah, that will basically disable the vectorisation of tosa.resize
altogether.
Great, let me try and reduce the file a bit (well and currently a MHLO input file, although I could run pipeline until just before the crashing pass and then it probably doesn't matter).
We can probably add a flag to dump the executable before codegen lowering. That would give us a small repro starting from codegen. We can easily reproduce it using iree-opt
. (I should do it earlier because that automates some triaging process. :p)
The dump before buildTranslationPassPipeline
can be an option: https://github.com/openxla/iree/blob/2f6808e53500e5ed4c914b8b234b1e3fe298e653/compiler/src/iree/compiler/Dialect/HAL/Transforms/TranslateExecutables.cpp#L67
Small'ish reproducer
func.func @main_dispatch_1_generic_20x1x24_f32(%workgroup_id_x: index, %workgroup_count_x: index, %workgroup_id_y: index, %workgroup_count_y: index, %9: tensor<20x1x24xf32>, %3: tensor<1x20xi32>, %4: tensor<257x24xf32>) {
%c1 = arith.constant 1 : index
%c4 = arith.constant 4 : index
%c24 = arith.constant 24 : index
%c20 = arith.constant 20 : index
%c0 = arith.constant 0 : index
%c256 = arith.constant 256 : index
%5 = affine.apply affine_map<()[s0] -> (s0 * 20)>()[%workgroup_id_y]
%6 = affine.apply affine_map<()[s0] -> (s0 * 20)>()[%workgroup_count_y]
scf.for %arg0 = %5 to %c20 step %6 {
%7 = affine.apply affine_map<()[s0] -> (s0 * 24)>()[%workgroup_id_x]
%8 = affine.apply affine_map<()[s0] -> (s0 * 24)>()[%workgroup_count_x]
scf.for %arg1 = %7 to %c24 step %8 {
%10 = scf.for %arg2 = %c0 to %c20 step %c1 iter_args(%arg3 = %9) -> (tensor<20x1x24xf32>) {
%11 = scf.for %arg4 = %c0 to %c24 step %c4 iter_args(%arg5 = %arg3) -> (tensor<20x1x24xf32>) {
%extracted_slice = tensor.extract_slice %arg5[%arg2, 0, %arg4] [1, 1, 4] [1, 1, 1] : tensor<20x1x24xf32> to tensor<1x1x4xf32>
%12 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} outs(%extracted_slice : tensor<1x1x4xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[20, 0, 24], [1, 1, 4], [0, 0, 0]]>} {
^bb0(%out: f32):
%13 = linalg.index 0 : index
%14 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg0, %13, %arg2)
%15 = linalg.index 2 : index
%16 = linalg.index 1 : index
%17 = affine.apply affine_map<(d0, d1, d2, d3) -> (d0 + d1 * 24 + d2 + d3)>(%arg1, %16, %15, %arg4)
%extracted = tensor.extract %3[%c0, %14] : tensor<1x20xi32>
%18 = arith.index_cast %extracted : i32 to index
%19 = arith.maxsi %18, %c0 : index
%20 = arith.minsi %19, %c256 : index
%extracted_0 = tensor.extract %4[%20, %17] : tensor<257x24xf32>
linalg.yield %extracted_0 : f32
} -> tensor<1x1x4xf32>
%inserted_slice = tensor.insert_slice %12 into %arg5[%arg2, 0, %arg4] [1, 1, 4] [1, 1, 1] : tensor<1x1x4xf32> into tensor<20x1x24xf32>
scf.yield %inserted_slice : tensor<20x1x24xf32>
}
scf.yield %11 : tensor<20x1x24xf32>
}
}
}
return
}
Reproduces with
iree-opt --pass-pipeline="builtin.module(func.func(iree-llvmcpu-vectorization{enable-vector-masking=true vectorize-gather-accesses=true vectorize-padding=false}))" input.mlir
@banach-space let me know if the cause isn't due to the one you expect (the IR here I made to be just upstream, pass is IREE one, but probably this behavior is shown via OSS pass too)
That's a great repro @jpienaar, thanks! And it does look like 100% mea culpa, sorry :(
It's not quite the change I was referring too originally - this is unrelated to the affine maps. Instead, it's a "contiguous vs gather" loads bug.
The vectoriser correctly identifies this tensor.extract
as a contiguous load:
%extracted = tensor.extract %3[%c0, %14] : tensor<1x20xi32>
However, the 2nd tensor.extract
should be categorized as "gather" instead of "contiguous":
%extracted_0 = tensor.extract %4[%20, %17] : tensor<257x24xf32>
That's due to these two operations:
%19 = arith.maxsi %18, %c0 : index
%20 = arith.minsi %19, %c256 : index
(annoyingly, it will be contiguous in many cases). This should fix it:
diff --git a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
index b54eb0fa9a4f..62f21abd98d2 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
@@ -773,6 +773,9 @@ static bool isLoopInvariantIdx(LinalgOp &linalgOp, Value &val) {
if (isa<arith::ConstantOp>(ancestor))
return true;
+ if (isa<arith::MaxSIOp>(ancestor))
+ return false;
+
bool result = true;
for (auto op : ancestor->getOperands())
result &= isLoopInvariantIdx(linalgOp, op);
Are you able to verify? I'll send a PR later today - is that OK? Sorry for the bother :(
-Andrzej
That fixes this and yes later today SGTM (I have disabled it for now, so we shouldn't hit segfaults and just get better performance post integrating your fix :))
Right, so I was slightly wrong. It's the first tensor.extract
:
%extracted = tensor.extract %3[%c0, %14] : tensor<1x20xi32>
that makes the 2nd tensor.extract
a gather load (it happens even before arith.maxsi
and arith.minsi
). I've just uploaded https://reviews.llvm.org/D148265.
Now, this case is even more interesting than I originally thought. Here's my reduced version:
func.func @main_dispatch_1_generic_20x1x24_f32(%input_1: tensor<1x20xi32>, %input_2: tensor<257x24xf32>, %arg0 : index, %arg1 : index, %arg2 : index, %arg3 : index) -> tensor<1x1x4xf32> {
%c0 = arith.constant 0 : index
%c256 = arith.constant 256 : index
%output = tensor.empty() : tensor<1x1x4xf32>
%1 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} outs(%output : tensor<1x1x4xf32>) {
^bb0(%out: f32):
%13 = linalg.index 0 : index
%14 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg0, %13, %arg2)
%15 = linalg.index 2 : index
%16 = linalg.index 1 : index
%17 = affine.apply affine_map<(d0, d1, d2, d3) -> (d0 + d1 * 24 + d2 + d3)>(%arg1, %16, %15, %arg3)
%extracted_0 = tensor.extract %input_1[%c0, %14] : tensor<1x20xi32>
%18 = arith.index_cast %extracted_0 : i32 to index
%19 = arith.maxsi %18, %c0 : index
%20 = arith.minsi %19, %c256 : index
%extracted_1 = tensor.extract %input_2[%20, %17] : tensor<257x24xf32>
linalg.yield %extracted_1 : f32
} -> tensor<1x1x4xf32>
return %1 : tensor<1x1x4xf32>
}
transform.sequence failures(propagate) {
^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
}
After vectorisation:
func.func @main_dispatch_1_generic_20x1x24_f32(%arg0: tensor<1x20xi32>, %arg1: tensor<257x24xf32>, %arg2: index, %arg3: index, %arg4: index, %arg5: index) -> tensor<1x1x4xf32> {
%cst = arith.constant dense<0> : vector<1x1x4xindex>
%cst_0 = arith.constant dense<[0, 1, 2, 3]> : vector<4xindex>
%cst_1 = arith.constant dense<true> : vector<1x1x4xi1>
%cst_2 = arith.constant dense<0> : vector<1x1x4xi32>
%c0 = arith.constant 0 : index
%cst_3 = arith.constant dense<256> : vector<1x1x4xindex>
%cst_4 = arith.constant dense<0.000000e+00> : vector<1x1x4xf32>
%cst_5 = arith.constant dense<24> : vector<1x1x4xindex>
%0 = tensor.empty() : tensor<1x1x4xf32>
%1 = vector.broadcast %arg2 : index to vector<1x1x4xindex>
%2 = vector.broadcast %arg4 : index to vector<1x1x4xindex>
%3 = arith.addi %1, %2 : vector<1x1x4xindex>
%4 = vector.broadcast %arg3 : index to vector<1x1x4xindex>
%5 = vector.broadcast %cst_0 : vector<4xindex> to vector<1x1x4xindex>
%6 = arith.addi %4, %5 : vector<1x1x4xindex>
%7 = vector.broadcast %arg5 : index to vector<1x1x4xindex>
%8 = arith.addi %6, %7 : vector<1x1x4xindex>
%9 = vector.gather %arg0[%c0, %c0] [%3], %cst_1, %cst_2 : tensor<1x20xi32>, vector<1x1x4xindex>, vector<1x1x4xi1>, vector<1x1x4xi32> into vector<1x1x4xi32>
%10 = arith.index_cast %9 : vector<1x1x4xi32> to vector<1x1x4xindex>
%11 = arith.maxsi %10, %cst : vector<1x1x4xindex>
%12 = arith.minsi %11, %cst_3 : vector<1x1x4xindex>
%13 = arith.muli %12, %cst_5 : vector<1x1x4xindex>
%14 = arith.addi %8, %13 : vector<1x1x4xindex>
%15 = vector.gather %arg1[%c0, %c0] [%14], %cst_1, %cst_4 : tensor<257x24xf32>, vector<1x1x4xindex>, vector<1x1x4xi1>, vector<1x1x4xf32> into vector<1x1x4xf32>
%16 = vector.transfer_write %15, %0[%c0, %c0, %c0] {in_bounds = [true, true, true]} : vector<1x1x4xf32>, tensor<1x1x4xf32>
return %16 : tensor<1x1x4xf32>
}
Both tensor.extract
s are treated as gather loads. Initially I thought that the first one was a contiguous load, but that was incorrect (the trailing index is constant as opposed to incrementing with the loop index). However, the first tensor.extract
is just a "loop invariant" scalar load:
%1 = vector.broadcast %arg2 : index to vector<1x1x4xindex>
%2 = vector.broadcast %arg4 : index to vector<1x1x4xindex>
%3 = arith.addi %1, %2 : vector<1x1x4xindex>
%9 = vector.gather %arg0[%c0, %c0] [%3], %cst_1, %cst_2 : tensor<1x20xi32>, vector<1x1x4xindex>, vector<1x1x4xi1>, vector<1x1x4xi32> into vector<1x1x4xi32>
This should be hoisted out of the original linalg.generic
.
just get better performance post integrating your fix :))
Sadly, in this case it's just gather loads :( But hopefully you will see perf boost elsewhere 🤞🏻 .
Diego made me realize that I was wrong and that the actual issue is with the affine maps that are "automatically" generated when creating a vector.transfer_read
Op.
Specifically, the 2nd tensor.extract
in the original example is a contiguous load because these Ops are actually loop invariant (I missed that previously):
%extracted = tensor.extract %3[%c0, %14] : tensor<1x20xi32>
%19 = arith.maxsi %18, %c0 : index
%20 = arith.minsi %19, %c256 : index
This will make the vectorizer generate (correctly) vector.transfer_read
, but because "rank(src) < rank (dst)`, the default affine maps are incorrect and hence the crash.
Fresh fix here: https://reviews.llvm.org/D148537. Third time lucky 🤞🏻
mlir/lib/IR/AffineMap.cpp:111: static mlir::AffineMap mlir::AffineMap::getMinorIdentityMap(unsigned int, unsigned int, mlir::MLIRContext *): Assertion `dims >= results && "Dimension mismatch"' failed. ... mlir::linalg::vectorize(mlir::RewriterBase&, mlir::linalg::LinalgOp, llvm::ArrayRef, bool) mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp:1055
(working on creating an external reproducer)
Originally posted by @jpienaar in https://github.com/openxla/iree/issues/12288#issuecomment-1504574958