Crash in LLVMGPUTileAndDistribute / Tiling interface

What happened?

I have been iterating on graph-level transformations and added this repro to iree-samples.

This is dependent on #12822

Running iree-compile --iree-hal-target-backends=cuda transform_dialect/graph/cuda/bugs/1-ir.mlir

crashes with:

```
libIREECompiler.so.0`std::_Function_handler<llvm::SmallVector<mlir::Value, 4u> (mlir::OpBuilder&, mlir::Operation*), mlir::iree_compiler::populateTilingToInvocationPatterns(mlir::RewritePatternSet&, llvm::SmallVectorImpl<long>&)::$_5>::_M_invoke(__functor=<unavailable>, __args=<unavailable>, __args=<unavailable>) at std_function.h:290:9
    frame #11: 0x00007ffff5298579 libIREECompiler.so.0`tileInterfaceOp(mlir::OpBuilder&, mlir::TilingInterface, mlir::linalg::LinalgTilingOptions const&) [inlined] std::function<llvm::SmallVector<mlir::Value, 4u> (mlir::OpBuilder&, mlir::Operation*)>::operator()(this=0x00007fff483844f8, __args=0x00007fff72ffa728, __args=0x00007fff483006c0) const at std_function.h:591:9
    frame #12: 0x00007ffff5298554 libIREECompiler.so.0`tileInterfaceOp(b=0x00007fff72ffa728, tilableOp=TilingInterface @ 0x00007fff72ff9de0, options=0x00007fff483844f8) at Tiling.cpp:202:7
    frame #13: 0x00007ffff5299e03 libIREECompiler.so.0`mlir::iree_compiler::IREE::LinalgExt::TilingInterfaceBaseTilingPattern::matchAndRewriteBase(this=0x00007fff48384410, tilableOp=TilingInterface @ 0x000016d57f052f40, rewriter=0x00007fff72ffa720, result=0x00007fff72ffa248) const at Tiling.cpp:260:28
    frame #14: 0x00007ffff5299fd1 libIREECompiler.so.0`mlir::iree_compiler::IREE::LinalgExt::TilingInterfaceTilingPattern::matchAndRewrite(this=0x00007fff48384410, tilableOp=TilingInterface @ 0x000016d57f053450, rewriter=0x00007fff72ffa720) const at Tiling.cpp:283:48
    frame #15: 0x00007ffff727f168 libIREECompiler.so.0`mlir::PatternApplicator::matchAndRewrite(this=0x00007fff72ffa8a0, op=<unavailable>, rewriter=0x00007fff72ffa720, canApply=mlir::function_ref<bool (const mlir::Pattern &)> @ 0x000016d57f053b70, onFailure=mlir::function_ref<void (const mlir::Pattern &)> @ 0x00007fff72ffa5f0, onSuccess=mlir::function_ref<mlir::LogicalResult (const mlir::Pattern &)> @ 0x00007fff72ffa600) at PatternApplicator.cpp:200:25
    frame #16: 0x00007ffff724e936 libIREECompiler.so.0`(anonymous namespace)::GreedyPatternRewriteDriver::processWorklist(this=0x00007fff72ffa720) at GreedyPatternRewriteDriver.cpp:229:19
    frame #17: 0x00007ffff724c04d libIREECompiler.so.0`mlir::applyPatternsAndFoldGreedily(mlir::Region&, mlir::FrozenRewritePatternSet const&, mlir::GreedyRewriteConfig) [inlined] (anonymous namespace)::RegionPatternRewriteDriver::simplify(this=0x00007fff72ffa720) && at GreedyPatternRewriteDriver.cpp:458:15
    frame #18: 0x00007ffff724be73 libIREECompiler.so.0`mlir::applyPatternsAndFoldGreedily(region=0x0000555555ef9b20, patterns=0x00007fff72ffab80, config=GreedyRewriteConfig @ 0x00007fff72ffa920) at GreedyPatternRewriteDriver.cpp:487:47
    frame #19: 0x00007ffff422e008 libIREECompiler.so.0`mlir::iree_compiler::(anonymous namespace)::LLVMGPUTileAndDistributePass::runOnOperation() at GreedyPatternRewriteDriver.h:115:15
    frame #20: 0x00007ffff422df66 libIREECompiler.so.0`mlir::iree_compiler::(anonymous namespace)::LLVMGPUTileAndDistributePass::runOnOperation(this=0x00007fff483740e0) at LLVMGPUTileAndDistribute.cpp:266:18
    frame #21: 0x00007ffff72a8085 libIREECompiler.so.0`mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) at Pass.cpp:482:17
```

Steps to reproduce your issue

Go to '...'
Click on '....'
Scroll down to '....'
See error

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

With the latest integrate, this is hiding the error I used to have which was:

  // IREE fails lowering of tensor.pack/unpack ops with:
  // <stdin>:23295:20: error: 'tensor.extract_slice' op unhandled operation when 
  // converting to destination passing style
  // %unpack_3675 = tensor.unpack %7830 inner_dims_pos = [0, 1] 
  //                inner_tiles = [384, 16] into %7826 
  //                : tensor<1x1x384x16xf32> -> tensor<384x2xf32>

I expect fixing the crash will then trigger the error I am expecting. I'll file another bug when confirmed.

Ill wait for #12822 to land before looking more.

This is a simpler repro

This is a simpler repro

Looking at the CPU side, I think there is a folder here that is missing that will help.

    %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<999x3999xf32>) -> tensor<999x3999xf32>
    %pack_4 = tensor.pack %1 padding_value(%cst_3 : f32) inner_dims_pos = [0, 1] inner_tiles = [128, 256] into %4 : tensor<999x3999xf32> -> tensor<8x16x128x256xf32>

This could just be folded into

%fill_folded = linalg.fill ins(%cst : f32) outs(%4 : tensor<8x16x128x256xf32>) -> tensor<8x16x128x256xf32>

Then a lot of the backend code-generation works out better. Basically (at least on the CPU side , I havent looked at the GPU side yet, but probably something similar) the fill -> pack code generation using tile + fuse with bufferization etc results in a stack allocation. We can try to address that but a folder at graph level makes all the problems go away. I tried your example with those changes manually, most of the codegeneration works as expected (see dump ir after all here : https://gist.github.com/MaheshRavishankar/115fcb30c47135c53a16355ea27387d8 )

Some points of interest. After dispatch region formation this is the IR

// -----// IR Dump After FormDispatchWorkgroups (iree-flow-form-dispatch-workgroups) //----- //
func.func @fill_matmul_static(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %c8 = arith.constant 8 : index
  %c63 = arith.constant 63 : index
  %c16 = arith.constant 16 : index
  %c1 = arith.constant 1 : index
  %c128 = arith.constant 128 : index
  %c256 = arith.constant 256 : index
  %c3999 = arith.constant 3999 : index
  %c999 = arith.constant 999 : index
  %0 = hal.tensor.import %arg0 "input 0" : !hal.buffer_view -> tensor<999x1999xf32>
  %1 = hal.tensor.import %arg1 "input 1" : !hal.buffer_view -> tensor<1999x3999xf32>
  %2 = hal.tensor.import %arg2 "input 2" : !hal.buffer_view -> tensor<999x3999xf32>
  %3 = flow.dispatch.workgroups[%c8, %c63](%0) : (tensor<999x1999xf32>) -> tensor<8x63x128x32xf32> =
      (%arg3: !flow.dispatch.tensor<readonly:tensor<999x1999xf32>>, %arg4: !flow.dispatch.tensor<writeonly:tensor<8x63x128x32xf32>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0], sizes = [999, 1999], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x1999xf32>> -> tensor<999x1999xf32>
    %9 = tensor.empty() : tensor<8x63x128x32xf32>
    %pack = tensor.pack %8 padding_value(%cst : f32) inner_dims_pos = [0, 1] inner_tiles = [128, 32] into %9 : tensor<999x1999xf32> -> tensor<8x63x128x32xf32>
    flow.dispatch.tensor.store %pack, %arg4, offsets = [0, 0, 0, 0], sizes = [8, 63, 128, 32], strides = [1, 1, 1, 1] : tensor<8x63x128x32xf32> -> !flow.dispatch.tensor<writeonly:tensor<8x63x128x32xf32>>
    flow.return
  } count(%arg3: index, %arg4: index) -> (index, index, index) {
    %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
    flow.return %x, %y, %z : index, index, index
  }
  %4 = flow.dispatch.workgroups[%c63, %c16](%1) : (tensor<1999x3999xf32>) -> tensor<63x16x256x32xf32> =
      (%arg3: !flow.dispatch.tensor<readonly:tensor<1999x3999xf32>>, %arg4: !flow.dispatch.tensor<writeonly:tensor<63x16x256x32xf32>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0], sizes = [1999, 3999], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<1999x3999xf32>> -> tensor<1999x3999xf32>
    %9 = tensor.empty() : tensor<63x16x256x32xf32>
    %pack = tensor.pack %8 padding_value(%cst : f32) inner_dims_pos = [1, 0] inner_tiles = [256, 32] into %9 : tensor<1999x3999xf32> -> tensor<63x16x256x32xf32>
    flow.dispatch.tensor.store %pack, %arg4, offsets = [0, 0, 0, 0], sizes = [63, 16, 256, 32], strides = [1, 1, 1, 1] : tensor<63x16x256x32xf32> -> !flow.dispatch.tensor<writeonly:tensor<63x16x256x32xf32>>
    flow.return
  } count(%arg3: index, %arg4: index) -> (index, index, index) {
    %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
    flow.return %x, %y, %z : index, index, index
  }
  %5 = flow.dispatch.workgroups[%c8, %c16, %c128, %c256, %c1, %c1](%3, %4) : (tensor<8x63x128x32xf32>, tensor<63x16x256x32xf32>) -> tensor<8x16x128x256xf32> =
      (%arg3: !flow.dispatch.tensor<readonly:tensor<8x63x128x32xf32>>, %arg4: !flow.dispatch.tensor<readonly:tensor<63x16x256x32xf32>>, %arg5: !flow.dispatch.tensor<writeonly:tensor<8x16x128x256xf32>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0, 0, 0], sizes = [8, 63, 128, 32], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x63x128x32xf32>> -> tensor<8x63x128x32xf32>
    %9 = flow.dispatch.tensor.load %arg4, offsets = [0, 0, 0, 0], sizes = [63, 16, 256, 32], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<63x16x256x32xf32>> -> tensor<63x16x256x32xf32>
    %10 = tensor.empty() : tensor<8x16x128x256xf32>
    %11 = linalg.fill ins(%cst : f32) outs(%10 : tensor<8x16x128x256xf32>) -> tensor<8x16x128x256xf32>
    %12 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d4, d2, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d4, d1, d3, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction"]} ins(%8, %9 : tensor<8x63x128x32xf32>, tensor<63x16x256x32xf32>) outs(%11 : tensor<8x16x128x256xf32>) {
    ^bb0(%in: f32, %in_0: f32, %out: f32):
      %13 = arith.mulf %in, %in_0 : f32
      %14 = arith.addf %out, %13 : f32
      linalg.yield %14 : f32
    } -> tensor<8x16x128x256xf32>
    flow.dispatch.tensor.store %12, %arg5, offsets = [0, 0, 0, 0], sizes = [8, 16, 128, 256], strides = [1, 1, 1, 1] : tensor<8x16x128x256xf32> -> !flow.dispatch.tensor<writeonly:tensor<8x16x128x256xf32>>
    flow.return
  } count(%arg3: index, %arg4: index, %arg5: index, %arg6: index, %arg7: index, %arg8: index) -> (index, index, index) {
    %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6, %arg7, %arg8
    flow.return %x, %y, %z : index, index, index
  }
  %6 = flow.dispatch.workgroups[%c999, %c3999](%5, %2) : (tensor<8x16x128x256xf32>, tensor<999x3999xf32>) -> tensor<999x3999xf32> =
      (%arg3: !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>>, %arg4: !flow.dispatch.tensor<readonly:tensor<999x3999xf32>>, %arg5: !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>) {
    %8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0, 0, 0], sizes = [8, 16, 128, 256], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>> -> tensor<8x16x128x256xf32>
    %9 = flow.dispatch.tensor.load %arg4, offsets = [0, 0], sizes = [999, 3999], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>> -> tensor<999x3999xf32>
    %unpack = tensor.unpack %8 inner_dims_pos = [0, 1] inner_tiles = [128, 256] into %9 : tensor<8x16x128x256xf32> -> tensor<999x3999xf32>
    flow.dispatch.tensor.store %unpack, %arg5, offsets = [0, 0], sizes = [999, 3999], strides = [1, 1] : tensor<999x3999xf32> -> !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>
    flow.return
  } count(%arg3: index, %arg4: index) -> (index, index, index) {
    %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
    flow.return %x, %y, %z : index, index, index
  }
  %7 = hal.tensor.export %6 "output 0" : tensor<999x3999xf32> -> !hal.buffer_view
  return %7 : !hal.buffer_view
}

That looks good.

The issue seems to be with the unpack dispatch. Something very strange happening with bufferization. This is the IR for unpack dispatch before bufferization

module {
  func.func @fill_matmul_static_dispatch_3() {
    %cst = arith.constant 0.000000e+00 : f32
    %c256 = arith.constant 256 : index
    %c128 = arith.constant 128 : index
    %c3999 = arith.constant 3999 : index
    %c999 = arith.constant 999 : index
    %c41287680 = arith.constant 41287680 : index
    %c0 = arith.constant 0 : index
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c41287680) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>>
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>>
    %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>
    %workgroup_id_x = hal.interface.workgroup.id[0] : index
    %workgroup_count_x = hal.interface.workgroup.count[0] : index
    %workgroup_id_y = hal.interface.workgroup.id[1] : index
    %workgroup_count_y = hal.interface.workgroup.count[1] : index
    %3 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_id_y]
    %4 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_count_y]
    scf.for %arg0 = %3 to %c999 step %4 {
      %5 = affine.min affine_map<(d0) -> (-d0 + 999, 128)>(%arg0)
      %6 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_id_x]
      %7 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_count_x]
      scf.for %arg1 = %6 to %c3999 step %7 {
        %8 = affine.min affine_map<(d0) -> (-d0 + 3999, 256)>(%arg1)
        %9 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg0)
        %10 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg1)
        %11 = flow.dispatch.tensor.load %0, offsets = [%9, %10, 0, 0], sizes = [1, 1, 128, 256], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>> -> tensor<1x1x128x256xf32>
        %12 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1], sizes = [%5, %8], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>> -> tensor<?x?xf32>
        %13 = scf.for %arg2 = %c0 to %5 step %c128 iter_args(%arg3 = %12) -> (tensor<?x?xf32>) {
          %14 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 128)>(%arg2)[%5]
          %15 = scf.for %arg4 = %c0 to %8 step %c256 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
            %16 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 256)>(%arg4)[%8]
            %17 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg2)
            %18 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg4)
            %extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
            %19 = vector.transfer_read %11[%17, %18, %c0, %c0], %cst {in_bounds = [true, true]} : tensor<1x1x128x256xf32>, vector<128x256xf32>
            %extracted_slice_0 = tensor.extract_slice %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
            %20 = vector.transfer_write %19, %extracted_slice_0[%c0, %c0] : vector<128x256xf32>, tensor<?x?xf32>
            %inserted_slice = tensor.insert_slice %20 into %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
            %inserted_slice_1 = tensor.insert_slice %inserted_slice into %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
            scf.yield %inserted_slice_1 : tensor<?x?xf32>
          }
          scf.yield %15 : tensor<?x?xf32>
        }
        flow.dispatch.tensor.store %13, %2, offsets = [%arg0, %arg1], sizes = [%5, %8], strides = [1, 1] : tensor<?x?xf32> -> !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>
      }
    }
    return
  }
}

and this is the after

// -----// IR Dump Before CleanupBufferAllocView (iree-codegen-cleanup-buffer-alloc-view) //----- //                                                                                                                                                                                                               
func.func @fill_matmul_static_dispatch_3() {
  %cst = arith.constant 0.000000e+00 : f32
  %c256 = arith.constant 256 : index
  %c128 = arith.constant 128 : index
  %c3999 = arith.constant 3999 : index
  %c999 = arith.constant 999 : index
  %c41287680 = arith.constant 41287680 : index
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c41287680) flags(ReadOnly) : memref<8x16x128x256xf32, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %0, 64 : memref<8x16x128x256xf32, #hal.descriptor_type<storage_buffer>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %1, 64 : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
  %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %2, 64 : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %3 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_id_y]
  %4 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_count_y]
  scf.for %arg0 = %3 to %c999 step %4 {
    %5 = affine.min affine_map<(d0) -> (-d0 + 999, 128)>(%arg0)
    %6 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_id_x]
    %7 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_count_x]
    scf.for %arg1 = %6 to %c3999 step %7 {
      %8 = affine.min affine_map<(d0) -> (-d0 + 3999, 256)>(%arg1)
      %9 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg0)
      %10 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg1)
      %subview = memref.subview %0[%9, %10, 0, 0] [1, 1, 128, 256] [1, 1, 1, 1] : memref<8x16x128x256xf32, #hal.descriptor_type<storage_buffer>> to memref<1x1x128x256xf32, strided<[524288, 32768, 256, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
      %subview_0 = memref.subview %1[%arg0, %arg1] [%5, %8] [1, 1] : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>> to memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
      %alloca = memref.alloca(%5, %8) {alignment = 64 : i64} : memref<?x?xf32, #hal.descriptor_type<storage_buffer>>
      linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_0 : memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) outs(%alloca : memref<?x?xf32, #hal.descriptor_type<\
storage_buffer>>) {
      ^bb0(%in: f32, %out: f32):
        linalg.yield %in : f32
      }
      scf.for %arg2 = %c0 to %5 step %c128 {
        %11 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 128)>(%arg2)[%5]
        scf.for %arg3 = %c0 to %8 step %c256 {
          %12 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 256)>(%arg3)[%8]
          %13 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg2)
          %14 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg3)
          %subview_2 = memref.subview %alloca[%arg2, %arg3] [%11, %12] [1, 1] : memref<?x?xf32, #hal.descriptor_type<storage_buffer>> to memref<?x?xf32, strided<[?, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
          %15 = vector.transfer_read %subview[%13, %14, %c0, %c0], %cst {in_bounds = [true, true]} : memref<1x1x128x256xf32, strided<[524288, 32768, 256, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<128x256xf32>
          vector.transfer_write %15, %subview_2[%c0, %c0] : vector<128x256xf32>, memref<?x?xf32, strided<[?, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
        }
      }
      %subview_1 = memref.subview %2[%arg0, %arg1] [%5, %8] [1, 1] : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>> to memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
      linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%alloca : memref<?x?xf32, #hal.descriptor_type<storage_buffer>>) outs(%subview_1 : memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<\
storage_buffer>>) {
      ^bb0(%in: f32, %out: f32):
        linalg.yield %in : f32
      }
    }
  }
  return
}

so, something went off the rails in bufferization.

Oh, and --iree-llvmcpu-fail-on-out-of-bounds-stack-allocation=false avoids compilation errors. It compiles, but takes a really really long time cause the vectors being used are huge (16x128x256 if I am reading this right) is really big vectors. The compilation errors are there to flag things that might be going off rails, so removing those checks puts you on a bad compilation path.

The scf.for op should not return tensor types. We usually run hoistRedundantVectorTransfers and hoistRedundantVectorTransfersOnTensor methods to get rid of tensor types. I tried it, but it doesn't work. The util can't handle the case at this moment.

The dest of transfer_write, extract_slice ops and insert_slice ops are all around iter_arg %arg5. Can we enhance the hoisting utils for the case?

        %9 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg0)
        %10 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg1)
        %11 = flow.dispatch.tensor.load %0, offsets = [%9, %10, 0, 0], sizes = [1, 1, 128, 256], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>> -> tensor<1x1x128x256xf32>
        %12 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1], sizes = [%5, %8], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>> -> tensor<?x?xf32>
        %13 = scf.for %arg2 = %c0 to %5 step %c128 iter_args(%arg3 = %12) -> (tensor<?x?xf32>) {
          %14 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 128)>(%arg2)[%5]
          %15 = scf.for %arg4 = %c0 to %8 step %c256 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
            %16 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 256)>(%arg4)[%8]
            %17 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg2)
            %18 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg4)
            %extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
            %19 = vector.transfer_read %11[%17, %18, %c0, %c0], %cst {in_bounds = [true, true]} : tensor<1x1x128x256xf32>, vector<128x256xf32>
            %extracted_slice_0 = tensor.extract_slice %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
            %20 = vector.transfer_write %19, %extracted_slice_0[%c0, %c0] : vector<128x256xf32>, tensor<?x?xf32>
            %inserted_slice = tensor.insert_slice %20 into %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
            %inserted_slice_1 = tensor.insert_slice %inserted_slice into %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
            scf.yield %inserted_slice_1 : tensor<?x?xf32>
          }
          scf.yield %15 : tensor<?x?xf32>
        }

Trying to narrow down the scope of this bug and boil this down to some concrete action items. There are broadly two options for convert a matmul to a shape that is better for the codegen. One is

Use pad/slice to just change the shape, but not the rank of the operations.
Use pack/unpack to move the operation into a higher dimensional compute operation.

With both these options the end state that we should be looking for is to have four dispatches.

One for pack/pad for the LHS
One for pack/pad for the RHS
One for the contraction-like operation
One for the unpack/slice of the result.

To achieve this for the pack/unpack path on the small repro that was added here https://github.com/openxla/iree/issues/12824#issuecomment-1489944071 we need to add a folding pattern.

Task 1: Add a folding pattern for

%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<999x3999xf32>) -> tensor<999x3999xf32>
%pack_4 = tensor.pack %1 padding_value(%cst_3 : f32) inner_dims_pos = [0, 1] inner_tiles = [128, 256] into %4 : tensor<999x3999xf32> -> tensor<8x16x128x256xf32>

%fill_folded = linalg.fill ins(%cst : f32) outs(%4 : tensor<8x16x128x256xf32>) -> tensor<8x16x128x256xf32>

Even without this we will get four dispatches, but you will have a fill -> pack fused dispatch which we could fold after the fact as well, but without the folding the codegeneration gets unnecessarily complicated.

To achieve four dispatches on the pad/slice path we need to make sure that the tensor.pad operation is not generalized into a linalg.fill -> tensor.insert_slice. This is WIP since not all backends handle tensor.pad properly, but there is a flag --iree-flow-enable-pad-handling that prevents this generalization at the Flow level for now to make this a codegen issue.

The next steps then are to look at having "good" code-generation for pack or pad.

Task 2: Interaction between pack and conversion to pad.

For the pack operation, if only some dimensions are packed, and others are just padded, handling this requires some care. The pack operations are meant to get the data laid out such that each "tile" is the working set of the core compute operation. If only the M and N dimensions are packed, and K is just padded, then before tiling of the K dimension, the pack operation needs to be converted into a pad like representation. Tiling the inner dimensions of the a pack operation goes against the design of pack operation AFAICS.

Thanks @MaheshRavishankar, let's split this down further as I seem to see some conflation here.

There are broadly two options for convert a matmul to a shape that is better for the codegen.

Sure, but this is not the problem that this issue refers to. The exercise here is not a perf breakdown but solving a bug; how this bug was surfaced is orthogonal IMO.

The issue is: IREE receives the IR above and crashes in IREE::LinalgExt::TilingInterfaceBaseTilingPattern. My understanding from offline discussion is that the issue happens in the GPU backend but that the same IR goes through the CPU backend fine.

So it seems the bug reduces to:

how does the GPU backend use TilingInterfaceBaseTilingPattern on pack/unpack ops differently than on the CPU backend ?
why does IREE::LinalgExt::TilingInterfaceBaseTilingPattern crash and can we make it return an actionnable error message instead of a stack trace ?

Now for the specific points looking ahead:

Add a folding pattern for ...

Yes, this is a good followup once the crash is fixed, this will remove one dispatch once we can run e2e and measure.

The next steps then are to look at having "good" code-generation for pack or pad.

If only the M and N dimensions are packed, and K is just padded, then before tiling of the K dimension, the pack operation needs to be converted into a pad like representation.

Yes, having a reasonable codegen strategy for pack and pad seems to be lacking on the GPU side, there are many low-hanging fruit opportunities. One of the avenues we have indeed been looking at is folding trivial packs into pads, which captures one of the use cases we're looking at for GPU.

Tiling the inner dimensions of the a pack operation goes against the design of pack operation AFAICS.

This deserves another full thread.

Thanks @MaheshRavishankar, let's split this down further as I seem to see some conflation here.
There are broadly two options for convert a matmul to a shape that is better for the codegen. 
Sure, but this is not the problem that this issue refers to. The exercise here is not a perf breakdown but solving a bug; how this bug was surfaced is orthogonal IMO.

The issue is: IREE receives the IR above and crashes in IREE::LinalgExt::TilingInterfaceBaseTilingPattern. My understanding from offline discussion is that the issue happens in the GPU backend but that the same IR goes through the CPU backend fine.

So it seems the bug reduces to:

how does the GPU backend use TilingInterfaceBaseTilingPattern on pack/unpack ops differently than on the CPU backend ?

why does IREE::LinalgExt::TilingInterfaceBaseTilingPattern crash and can we make it return an actionnable error message instead of a stack trace ?

This part is confusing to me. I dont see any uses of IREE::LinalgExtTilingInterfaceBaseTilingPattern on the CPU backend. That is really legacy and should not be used. AFAIK, CPU side doesnt use that and that might be the difference.

From GPU sync: Nicolas out for 2 weeks, discuss priority offline.

@MaheshRavishankar @mattwalsh We should discuss priority and resourcing for this issue during GPU sync.

From meeting today: holding this open for Nicolas when he returns to work, not a blocking item.

iree-org / iree