iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.79k stars 604 forks source link

Crash in LinAlg gather vectorization #13036

Closed jpienaar closed 1 year ago

jpienaar commented 1 year ago
          I'm getting a crash that bisect's to here:

mlir/lib/IR/AffineMap.cpp:111: static mlir::AffineMap mlir::AffineMap::getMinorIdentityMap(unsigned int, unsigned int, mlir::MLIRContext *): Assertion `dims >= results && "Dimension mismatch"' failed. ... mlir::linalg::vectorize(mlir::RewriterBase&, mlir::linalg::LinalgOp, llvm::ArrayRef, bool) mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp:1055

(working on creating an external reproducer)

Originally posted by @jpienaar in https://github.com/openxla/iree/issues/12288#issuecomment-1504574958

jpienaar commented 1 year ago

Setting vectorizeGatherAccesses to false avoids the crash.

banach-space commented 1 year ago

Sorry that you are hitting this. This is most likely due to https://reviews.llvm.org/D143429 (that I authored, so mea culpa), which was required to unblock the vectorisation of tosa.resize.

I can take a look at this, just need a repro first :)

Setting vectorizeGatherAccesses to false avoids the crash.

Yeah, that will basically disable the vectorisation of tosa.resize altogether.

jpienaar commented 1 year ago

Great, let me try and reduce the file a bit (well and currently a MHLO input file, although I could run pipeline until just before the crashing pass and then it probably doesn't matter).

hanhanW commented 1 year ago

We can probably add a flag to dump the executable before codegen lowering. That would give us a small repro starting from codegen. We can easily reproduce it using iree-opt. (I should do it earlier because that automates some triaging process. :p)

The dump before buildTranslationPassPipeline can be an option: https://github.com/openxla/iree/blob/2f6808e53500e5ed4c914b8b234b1e3fe298e653/compiler/src/iree/compiler/Dialect/HAL/Transforms/TranslateExecutables.cpp#L67

jpienaar commented 1 year ago

Small'ish reproducer

func.func @main_dispatch_1_generic_20x1x24_f32(%workgroup_id_x: index, %workgroup_count_x: index, %workgroup_id_y: index, %workgroup_count_y: index, %9: tensor<20x1x24xf32>, %3: tensor<1x20xi32>, %4: tensor<257x24xf32>) {
  %c1 = arith.constant 1 : index
  %c4 = arith.constant 4 : index
  %c24 = arith.constant 24 : index
  %c20 = arith.constant 20 : index
  %c0 = arith.constant 0 : index
  %c256 = arith.constant 256 : index
  %5 = affine.apply affine_map<()[s0] -> (s0 * 20)>()[%workgroup_id_y]
  %6 = affine.apply affine_map<()[s0] -> (s0 * 20)>()[%workgroup_count_y]
  scf.for %arg0 = %5 to %c20 step %6 {
    %7 = affine.apply affine_map<()[s0] -> (s0 * 24)>()[%workgroup_id_x]
    %8 = affine.apply affine_map<()[s0] -> (s0 * 24)>()[%workgroup_count_x]
    scf.for %arg1 = %7 to %c24 step %8 {
      %10 = scf.for %arg2 = %c0 to %c20 step %c1 iter_args(%arg3 = %9) -> (tensor<20x1x24xf32>) {
        %11 = scf.for %arg4 = %c0 to %c24 step %c4 iter_args(%arg5 = %arg3) -> (tensor<20x1x24xf32>) {
          %extracted_slice = tensor.extract_slice %arg5[%arg2, 0, %arg4] [1, 1, 4] [1, 1, 1] : tensor<20x1x24xf32> to tensor<1x1x4xf32>
          %12 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} outs(%extracted_slice : tensor<1x1x4xf32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[20, 0, 24], [1, 1, 4], [0, 0, 0]]>} {
          ^bb0(%out: f32):
            %13 = linalg.index 0 : index
            %14 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg0, %13, %arg2)
            %15 = linalg.index 2 : index
            %16 = linalg.index 1 : index
            %17 = affine.apply affine_map<(d0, d1, d2, d3) -> (d0 + d1 * 24 + d2 + d3)>(%arg1, %16, %15, %arg4)
            %extracted = tensor.extract %3[%c0, %14] : tensor<1x20xi32>
            %18 = arith.index_cast %extracted : i32 to index
            %19 = arith.maxsi %18, %c0 : index
            %20 = arith.minsi %19, %c256 : index
            %extracted_0 = tensor.extract %4[%20, %17] : tensor<257x24xf32>
            linalg.yield %extracted_0 : f32
          } -> tensor<1x1x4xf32>
          %inserted_slice = tensor.insert_slice %12 into %arg5[%arg2, 0, %arg4] [1, 1, 4] [1, 1, 1] : tensor<1x1x4xf32> into tensor<20x1x24xf32>
          scf.yield %inserted_slice : tensor<20x1x24xf32>
        }
        scf.yield %11 : tensor<20x1x24xf32>
      }
    }
  }
  return
}

Reproduces with

iree-opt --pass-pipeline="builtin.module(func.func(iree-llvmcpu-vectorization{enable-vector-masking=true vectorize-gather-accesses=true vectorize-padding=false}))" input.mlir
jpienaar commented 1 year ago

@banach-space let me know if the cause isn't due to the one you expect (the IR here I made to be just upstream, pass is IREE one, but probably this behavior is shown via OSS pass too)

banach-space commented 1 year ago

That's a great repro @jpienaar, thanks! And it does look like 100% mea culpa, sorry :(

It's not quite the change I was referring too originally - this is unrelated to the affine maps. Instead, it's a "contiguous vs gather" loads bug.

The vectoriser correctly identifies this tensor.extract as a contiguous load:

            %extracted = tensor.extract %3[%c0, %14] : tensor<1x20xi32>

However, the 2nd tensor.extract should be categorized as "gather" instead of "contiguous":

            %extracted_0 = tensor.extract %4[%20, %17] : tensor<257x24xf32>

That's due to these two operations:

            %19 = arith.maxsi %18, %c0 : index
            %20 = arith.minsi %19, %c256 : index

(annoyingly, it will be contiguous in many cases). This should fix it:

diff --git a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
index b54eb0fa9a4f..62f21abd98d2 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
@@ -773,6 +773,9 @@ static bool isLoopInvariantIdx(LinalgOp &linalgOp, Value &val) {
   if (isa<arith::ConstantOp>(ancestor))
     return true;

+  if (isa<arith::MaxSIOp>(ancestor))
+    return false;
+
   bool result = true;
   for (auto op : ancestor->getOperands())
     result &= isLoopInvariantIdx(linalgOp, op);

Are you able to verify? I'll send a PR later today - is that OK? Sorry for the bother :(

-Andrzej

jpienaar commented 1 year ago

That fixes this and yes later today SGTM (I have disabled it for now, so we shouldn't hit segfaults and just get better performance post integrating your fix :))

banach-space commented 1 year ago

Right, so I was slightly wrong. It's the first tensor.extract:

that makes the 2nd tensor.extract a gather load (it happens even before arith.maxsi and arith.minsi). I've just uploaded https://reviews.llvm.org/D148265.

Now, this case is even more interesting than I originally thought. Here's my reduced version:

func.func @main_dispatch_1_generic_20x1x24_f32(%input_1: tensor<1x20xi32>, %input_2: tensor<257x24xf32>, %arg0 : index, %arg1 : index, %arg2 : index, %arg3 : index) -> tensor<1x1x4xf32> {
  %c0 = arith.constant 0 : index
  %c256 = arith.constant 256 : index
  %output = tensor.empty() : tensor<1x1x4xf32>
  %1 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} outs(%output : tensor<1x1x4xf32>) {
  ^bb0(%out: f32):
    %13 = linalg.index 0 : index
    %14 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg0, %13, %arg2)
    %15 = linalg.index 2 : index
    %16 = linalg.index 1 : index
    %17 = affine.apply affine_map<(d0, d1, d2, d3) -> (d0 + d1 * 24 + d2 + d3)>(%arg1, %16, %15, %arg3)
    %extracted_0 = tensor.extract %input_1[%c0, %14] : tensor<1x20xi32>
    %18 = arith.index_cast %extracted_0 : i32 to index
    %19 = arith.maxsi %18, %c0 : index
    %20 = arith.minsi %19, %c256 : index
    %extracted_1 = tensor.extract %input_2[%20, %17] : tensor<257x24xf32>
    linalg.yield %extracted_1 : f32
  } -> tensor<1x1x4xf32>
  return %1 : tensor<1x1x4xf32>
}

transform.sequence failures(propagate) {
^bb1(%arg1: !pdl.operation):
  %0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
  %1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
  %2 = transform.structured.vectorize %1  { vectorize_nd_extract }
}

After vectorisation:

  func.func @main_dispatch_1_generic_20x1x24_f32(%arg0: tensor<1x20xi32>, %arg1: tensor<257x24xf32>, %arg2: index, %arg3: index, %arg4: index, %arg5: index) -> tensor<1x1x4xf32> {
    %cst = arith.constant dense<0> : vector<1x1x4xindex>
    %cst_0 = arith.constant dense<[0, 1, 2, 3]> : vector<4xindex>
    %cst_1 = arith.constant dense<true> : vector<1x1x4xi1>
    %cst_2 = arith.constant dense<0> : vector<1x1x4xi32>
    %c0 = arith.constant 0 : index
    %cst_3 = arith.constant dense<256> : vector<1x1x4xindex>
    %cst_4 = arith.constant dense<0.000000e+00> : vector<1x1x4xf32>
    %cst_5 = arith.constant dense<24> : vector<1x1x4xindex>
    %0 = tensor.empty() : tensor<1x1x4xf32>
    %1 = vector.broadcast %arg2 : index to vector<1x1x4xindex>
    %2 = vector.broadcast %arg4 : index to vector<1x1x4xindex>
    %3 = arith.addi %1, %2 : vector<1x1x4xindex>
    %4 = vector.broadcast %arg3 : index to vector<1x1x4xindex>
    %5 = vector.broadcast %cst_0 : vector<4xindex> to vector<1x1x4xindex>
    %6 = arith.addi %4, %5 : vector<1x1x4xindex>
    %7 = vector.broadcast %arg5 : index to vector<1x1x4xindex>
    %8 = arith.addi %6, %7 : vector<1x1x4xindex>
    %9 = vector.gather %arg0[%c0, %c0] [%3], %cst_1, %cst_2 : tensor<1x20xi32>, vector<1x1x4xindex>, vector<1x1x4xi1>, vector<1x1x4xi32> into vector<1x1x4xi32>
    %10 = arith.index_cast %9 : vector<1x1x4xi32> to vector<1x1x4xindex>
    %11 = arith.maxsi %10, %cst : vector<1x1x4xindex>
    %12 = arith.minsi %11, %cst_3 : vector<1x1x4xindex>
    %13 = arith.muli %12, %cst_5 : vector<1x1x4xindex>
    %14 = arith.addi %8, %13 : vector<1x1x4xindex>
    %15 = vector.gather %arg1[%c0, %c0] [%14], %cst_1, %cst_4 : tensor<257x24xf32>, vector<1x1x4xindex>, vector<1x1x4xi1>, vector<1x1x4xf32> into vector<1x1x4xf32>
    %16 = vector.transfer_write %15, %0[%c0, %c0, %c0] {in_bounds = [true, true, true]} : vector<1x1x4xf32>, tensor<1x1x4xf32>
    return %16 : tensor<1x1x4xf32>
  }

Both tensor.extracts are treated as gather loads. Initially I thought that the first one was a contiguous load, but that was incorrect (the trailing index is constant as opposed to incrementing with the loop index). However, the first tensor.extract is just a "loop invariant" scalar load:

     %1 = vector.broadcast %arg2 : index to vector<1x1x4xindex>
    %2 = vector.broadcast %arg4 : index to vector<1x1x4xindex>
    %3 = arith.addi %1, %2 : vector<1x1x4xindex>
    %9 = vector.gather %arg0[%c0, %c0] [%3], %cst_1, %cst_2 : tensor<1x20xi32>, vector<1x1x4xindex>, vector<1x1x4xi1>, vector<1x1x4xi32> into vector<1x1x4xi32>

This should be hoisted out of the original linalg.generic.

just get better performance post integrating your fix :))

Sadly, in this case it's just gather loads :( But hopefully you will see perf boost elsewhere 🤞🏻 .

banach-space commented 1 year ago

Diego made me realize that I was wrong and that the actual issue is with the affine maps that are "automatically" generated when creating a vector.transfer_read Op.

Specifically, the 2nd tensor.extract in the original example is a contiguous load because these Ops are actually loop invariant (I missed that previously):

            %extracted = tensor.extract %3[%c0, %14] : tensor<1x20xi32>
            %19 = arith.maxsi %18, %c0 : index
            %20 = arith.minsi %19, %c256 : index

This will make the vectorizer generate (correctly) vector.transfer_read, but because "rank(src) < rank (dst)`, the default affine maps are incorrect and hence the crash.

Fresh fix here: https://reviews.llvm.org/D148537. Third time lucky 🤞🏻