Open nicolasvasilache opened 1 year ago
With the latest integrate, this is hiding the error I used to have which was:
// IREE fails lowering of tensor.pack/unpack ops with:
// <stdin>:23295:20: error: 'tensor.extract_slice' op unhandled operation when
// converting to destination passing style
// %unpack_3675 = tensor.unpack %7830 inner_dims_pos = [0, 1]
// inner_tiles = [384, 16] into %7826
// : tensor<1x1x384x16xf32> -> tensor<384x2xf32>
I expect fixing the crash will then trigger the error I am expecting. I'll file another bug when confirmed.
Ill wait for #12822 to land before looking more.
This is a simpler repro
This is a simpler repro
Looking at the CPU side, I think there is a folder here that is missing that will help.
%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<999x3999xf32>) -> tensor<999x3999xf32>
%pack_4 = tensor.pack %1 padding_value(%cst_3 : f32) inner_dims_pos = [0, 1] inner_tiles = [128, 256] into %4 : tensor<999x3999xf32> -> tensor<8x16x128x256xf32>
This could just be folded into
%fill_folded = linalg.fill ins(%cst : f32) outs(%4 : tensor<8x16x128x256xf32>) -> tensor<8x16x128x256xf32>
Then a lot of the backend code-generation works out better. Basically (at least on the CPU side , I havent looked at the GPU side yet, but probably something similar) the fill -> pack
code generation using tile + fuse with bufferization etc results in a stack allocation. We can try to address that but a folder at graph level makes all the problems go away. I tried your example with those changes manually, most of the codegeneration works as expected (see dump ir after all here : https://gist.github.com/MaheshRavishankar/115fcb30c47135c53a16355ea27387d8 )
Some points of interest. After dispatch region formation this is the IR
// -----// IR Dump After FormDispatchWorkgroups (iree-flow-form-dispatch-workgroups) //----- //
func.func @fill_matmul_static(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
%c8 = arith.constant 8 : index
%c63 = arith.constant 63 : index
%c16 = arith.constant 16 : index
%c1 = arith.constant 1 : index
%c128 = arith.constant 128 : index
%c256 = arith.constant 256 : index
%c3999 = arith.constant 3999 : index
%c999 = arith.constant 999 : index
%0 = hal.tensor.import %arg0 "input 0" : !hal.buffer_view -> tensor<999x1999xf32>
%1 = hal.tensor.import %arg1 "input 1" : !hal.buffer_view -> tensor<1999x3999xf32>
%2 = hal.tensor.import %arg2 "input 2" : !hal.buffer_view -> tensor<999x3999xf32>
%3 = flow.dispatch.workgroups[%c8, %c63](%0) : (tensor<999x1999xf32>) -> tensor<8x63x128x32xf32> =
(%arg3: !flow.dispatch.tensor<readonly:tensor<999x1999xf32>>, %arg4: !flow.dispatch.tensor<writeonly:tensor<8x63x128x32xf32>>) {
%cst = arith.constant 0.000000e+00 : f32
%8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0], sizes = [999, 1999], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x1999xf32>> -> tensor<999x1999xf32>
%9 = tensor.empty() : tensor<8x63x128x32xf32>
%pack = tensor.pack %8 padding_value(%cst : f32) inner_dims_pos = [0, 1] inner_tiles = [128, 32] into %9 : tensor<999x1999xf32> -> tensor<8x63x128x32xf32>
flow.dispatch.tensor.store %pack, %arg4, offsets = [0, 0, 0, 0], sizes = [8, 63, 128, 32], strides = [1, 1, 1, 1] : tensor<8x63x128x32xf32> -> !flow.dispatch.tensor<writeonly:tensor<8x63x128x32xf32>>
flow.return
} count(%arg3: index, %arg4: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
flow.return %x, %y, %z : index, index, index
}
%4 = flow.dispatch.workgroups[%c63, %c16](%1) : (tensor<1999x3999xf32>) -> tensor<63x16x256x32xf32> =
(%arg3: !flow.dispatch.tensor<readonly:tensor<1999x3999xf32>>, %arg4: !flow.dispatch.tensor<writeonly:tensor<63x16x256x32xf32>>) {
%cst = arith.constant 0.000000e+00 : f32
%8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0], sizes = [1999, 3999], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<1999x3999xf32>> -> tensor<1999x3999xf32>
%9 = tensor.empty() : tensor<63x16x256x32xf32>
%pack = tensor.pack %8 padding_value(%cst : f32) inner_dims_pos = [1, 0] inner_tiles = [256, 32] into %9 : tensor<1999x3999xf32> -> tensor<63x16x256x32xf32>
flow.dispatch.tensor.store %pack, %arg4, offsets = [0, 0, 0, 0], sizes = [63, 16, 256, 32], strides = [1, 1, 1, 1] : tensor<63x16x256x32xf32> -> !flow.dispatch.tensor<writeonly:tensor<63x16x256x32xf32>>
flow.return
} count(%arg3: index, %arg4: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
flow.return %x, %y, %z : index, index, index
}
%5 = flow.dispatch.workgroups[%c8, %c16, %c128, %c256, %c1, %c1](%3, %4) : (tensor<8x63x128x32xf32>, tensor<63x16x256x32xf32>) -> tensor<8x16x128x256xf32> =
(%arg3: !flow.dispatch.tensor<readonly:tensor<8x63x128x32xf32>>, %arg4: !flow.dispatch.tensor<readonly:tensor<63x16x256x32xf32>>, %arg5: !flow.dispatch.tensor<writeonly:tensor<8x16x128x256xf32>>) {
%cst = arith.constant 0.000000e+00 : f32
%8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0, 0, 0], sizes = [8, 63, 128, 32], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x63x128x32xf32>> -> tensor<8x63x128x32xf32>
%9 = flow.dispatch.tensor.load %arg4, offsets = [0, 0, 0, 0], sizes = [63, 16, 256, 32], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<63x16x256x32xf32>> -> tensor<63x16x256x32xf32>
%10 = tensor.empty() : tensor<8x16x128x256xf32>
%11 = linalg.fill ins(%cst : f32) outs(%10 : tensor<8x16x128x256xf32>) -> tensor<8x16x128x256xf32>
%12 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d4, d2, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d4, d1, d3, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction"]} ins(%8, %9 : tensor<8x63x128x32xf32>, tensor<63x16x256x32xf32>) outs(%11 : tensor<8x16x128x256xf32>) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%13 = arith.mulf %in, %in_0 : f32
%14 = arith.addf %out, %13 : f32
linalg.yield %14 : f32
} -> tensor<8x16x128x256xf32>
flow.dispatch.tensor.store %12, %arg5, offsets = [0, 0, 0, 0], sizes = [8, 16, 128, 256], strides = [1, 1, 1, 1] : tensor<8x16x128x256xf32> -> !flow.dispatch.tensor<writeonly:tensor<8x16x128x256xf32>>
flow.return
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index, %arg7: index, %arg8: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6, %arg7, %arg8
flow.return %x, %y, %z : index, index, index
}
%6 = flow.dispatch.workgroups[%c999, %c3999](%5, %2) : (tensor<8x16x128x256xf32>, tensor<999x3999xf32>) -> tensor<999x3999xf32> =
(%arg3: !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>>, %arg4: !flow.dispatch.tensor<readonly:tensor<999x3999xf32>>, %arg5: !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>) {
%8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0, 0, 0], sizes = [8, 16, 128, 256], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>> -> tensor<8x16x128x256xf32>
%9 = flow.dispatch.tensor.load %arg4, offsets = [0, 0], sizes = [999, 3999], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>> -> tensor<999x3999xf32>
%unpack = tensor.unpack %8 inner_dims_pos = [0, 1] inner_tiles = [128, 256] into %9 : tensor<8x16x128x256xf32> -> tensor<999x3999xf32>
flow.dispatch.tensor.store %unpack, %arg5, offsets = [0, 0], sizes = [999, 3999], strides = [1, 1] : tensor<999x3999xf32> -> !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>
flow.return
} count(%arg3: index, %arg4: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
flow.return %x, %y, %z : index, index, index
}
%7 = hal.tensor.export %6 "output 0" : tensor<999x3999xf32> -> !hal.buffer_view
return %7 : !hal.buffer_view
}
That looks good.
The issue seems to be with the unpack dispatch. Something very strange happening with bufferization. This is the IR for unpack dispatch before bufferization
module {
func.func @fill_matmul_static_dispatch_3() {
%cst = arith.constant 0.000000e+00 : f32
%c256 = arith.constant 256 : index
%c128 = arith.constant 128 : index
%c3999 = arith.constant 3999 : index
%c999 = arith.constant 999 : index
%c41287680 = arith.constant 41287680 : index
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c41287680) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>>
%2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
%3 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_id_y]
%4 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_count_y]
scf.for %arg0 = %3 to %c999 step %4 {
%5 = affine.min affine_map<(d0) -> (-d0 + 999, 128)>(%arg0)
%6 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_id_x]
%7 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_count_x]
scf.for %arg1 = %6 to %c3999 step %7 {
%8 = affine.min affine_map<(d0) -> (-d0 + 3999, 256)>(%arg1)
%9 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg0)
%10 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg1)
%11 = flow.dispatch.tensor.load %0, offsets = [%9, %10, 0, 0], sizes = [1, 1, 128, 256], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>> -> tensor<1x1x128x256xf32>
%12 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1], sizes = [%5, %8], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>> -> tensor<?x?xf32>
%13 = scf.for %arg2 = %c0 to %5 step %c128 iter_args(%arg3 = %12) -> (tensor<?x?xf32>) {
%14 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 128)>(%arg2)[%5]
%15 = scf.for %arg4 = %c0 to %8 step %c256 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
%16 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 256)>(%arg4)[%8]
%17 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg2)
%18 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg4)
%extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%19 = vector.transfer_read %11[%17, %18, %c0, %c0], %cst {in_bounds = [true, true]} : tensor<1x1x128x256xf32>, vector<128x256xf32>
%extracted_slice_0 = tensor.extract_slice %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%20 = vector.transfer_write %19, %extracted_slice_0[%c0, %c0] : vector<128x256xf32>, tensor<?x?xf32>
%inserted_slice = tensor.insert_slice %20 into %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
%inserted_slice_1 = tensor.insert_slice %inserted_slice into %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
scf.yield %inserted_slice_1 : tensor<?x?xf32>
}
scf.yield %15 : tensor<?x?xf32>
}
flow.dispatch.tensor.store %13, %2, offsets = [%arg0, %arg1], sizes = [%5, %8], strides = [1, 1] : tensor<?x?xf32> -> !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>
}
}
return
}
}
and this is the after
// -----// IR Dump Before CleanupBufferAllocView (iree-codegen-cleanup-buffer-alloc-view) //----- //
func.func @fill_matmul_static_dispatch_3() {
%cst = arith.constant 0.000000e+00 : f32
%c256 = arith.constant 256 : index
%c128 = arith.constant 128 : index
%c3999 = arith.constant 3999 : index
%c999 = arith.constant 999 : index
%c41287680 = arith.constant 41287680 : index
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c41287680) flags(ReadOnly) : memref<8x16x128x256xf32, #hal.descriptor_type<storage_buffer>>
memref.assume_alignment %0, 64 : memref<8x16x128x256xf32, #hal.descriptor_type<storage_buffer>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
memref.assume_alignment %1, 64 : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
%2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
memref.assume_alignment %2, 64 : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
%3 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_id_y]
%4 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_count_y]
scf.for %arg0 = %3 to %c999 step %4 {
%5 = affine.min affine_map<(d0) -> (-d0 + 999, 128)>(%arg0)
%6 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_id_x]
%7 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_count_x]
scf.for %arg1 = %6 to %c3999 step %7 {
%8 = affine.min affine_map<(d0) -> (-d0 + 3999, 256)>(%arg1)
%9 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg0)
%10 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg1)
%subview = memref.subview %0[%9, %10, 0, 0] [1, 1, 128, 256] [1, 1, 1, 1] : memref<8x16x128x256xf32, #hal.descriptor_type<storage_buffer>> to memref<1x1x128x256xf32, strided<[524288, 32768, 256, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%subview_0 = memref.subview %1[%arg0, %arg1] [%5, %8] [1, 1] : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>> to memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%alloca = memref.alloca(%5, %8) {alignment = 64 : i64} : memref<?x?xf32, #hal.descriptor_type<storage_buffer>>
linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_0 : memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) outs(%alloca : memref<?x?xf32, #hal.descriptor_type<\
storage_buffer>>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
}
scf.for %arg2 = %c0 to %5 step %c128 {
%11 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 128)>(%arg2)[%5]
scf.for %arg3 = %c0 to %8 step %c256 {
%12 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 256)>(%arg3)[%8]
%13 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg2)
%14 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg3)
%subview_2 = memref.subview %alloca[%arg2, %arg3] [%11, %12] [1, 1] : memref<?x?xf32, #hal.descriptor_type<storage_buffer>> to memref<?x?xf32, strided<[?, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%15 = vector.transfer_read %subview[%13, %14, %c0, %c0], %cst {in_bounds = [true, true]} : memref<1x1x128x256xf32, strided<[524288, 32768, 256, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<128x256xf32>
vector.transfer_write %15, %subview_2[%c0, %c0] : vector<128x256xf32>, memref<?x?xf32, strided<[?, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
}
}
%subview_1 = memref.subview %2[%arg0, %arg1] [%5, %8] [1, 1] : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>> to memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%alloca : memref<?x?xf32, #hal.descriptor_type<storage_buffer>>) outs(%subview_1 : memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<\
storage_buffer>>) {
^bb0(%in: f32, %out: f32):
linalg.yield %in : f32
}
}
}
return
}
so, something went off the rails in bufferization.
Oh, and --iree-llvmcpu-fail-on-out-of-bounds-stack-allocation=false
avoids compilation errors. It compiles, but takes a really really long time cause the vectors being used are huge (16x128x256
if I am reading this right) is really big vectors. The compilation errors are there to flag things that might be going off rails, so removing those checks puts you on a bad compilation path.
The scf.for op should not return tensor types. We usually run hoistRedundantVectorTransfers
and hoistRedundantVectorTransfersOnTensor
methods to get rid of tensor types. I tried it, but it doesn't work. The util can't handle the case at this moment.
The dest of transfer_write, extract_slice ops and insert_slice ops are all around iter_arg %arg5
. Can we enhance the hoisting utils for the case?
%9 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg0)
%10 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg1)
%11 = flow.dispatch.tensor.load %0, offsets = [%9, %10, 0, 0], sizes = [1, 1, 128, 256], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>> -> tensor<1x1x128x256xf32>
%12 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1], sizes = [%5, %8], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>> -> tensor<?x?xf32>
%13 = scf.for %arg2 = %c0 to %5 step %c128 iter_args(%arg3 = %12) -> (tensor<?x?xf32>) {
%14 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 128)>(%arg2)[%5]
%15 = scf.for %arg4 = %c0 to %8 step %c256 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
%16 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 256)>(%arg4)[%8]
%17 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg2)
%18 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg4)
%extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%19 = vector.transfer_read %11[%17, %18, %c0, %c0], %cst {in_bounds = [true, true]} : tensor<1x1x128x256xf32>, vector<128x256xf32>
%extracted_slice_0 = tensor.extract_slice %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%20 = vector.transfer_write %19, %extracted_slice_0[%c0, %c0] : vector<128x256xf32>, tensor<?x?xf32>
%inserted_slice = tensor.insert_slice %20 into %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
%inserted_slice_1 = tensor.insert_slice %inserted_slice into %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
scf.yield %inserted_slice_1 : tensor<?x?xf32>
}
scf.yield %15 : tensor<?x?xf32>
}
ping on p0 bug?
Trying to narrow down the scope of this bug and boil this down to some concrete action items. There are broadly two options for convert a matmul to a shape that is better for the codegen. One is
With both these options the end state that we should be looking for is to have four dispatches.
To achieve this for the pack/unpack path on the small repro that was added here https://github.com/openxla/iree/issues/12824#issuecomment-1489944071 we need to add a folding pattern.
Task 1: Add a folding pattern for
%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<999x3999xf32>) -> tensor<999x3999xf32>
%pack_4 = tensor.pack %1 padding_value(%cst_3 : f32) inner_dims_pos = [0, 1] inner_tiles = [128, 256] into %4 : tensor<999x3999xf32> -> tensor<8x16x128x256xf32>
to
%fill_folded = linalg.fill ins(%cst : f32) outs(%4 : tensor<8x16x128x256xf32>) -> tensor<8x16x128x256xf32>
Even without this we will get four dispatches, but you will have a fill
-> pack
fused dispatch which we could fold after the fact as well, but without the folding the codegeneration gets unnecessarily complicated.
To achieve four dispatches on the pad/slice path we need to make sure that the tensor.pad
operation is not generalized into a linalg.fill
-> tensor.insert_slice
. This is WIP since not all backends handle tensor.pad
properly, but there is a flag --iree-flow-enable-pad-handling
that prevents this generalization at the Flow level for now to make this a codegen issue.
The next steps then are to look at having "good" code-generation for pack
or pad
.
Task 2: Interaction between pack
and conversion to pad
.
For the pack
operation, if only some dimensions are packed, and others are just padded, handling this requires some care. The pack
operations are meant to get the data laid out such that each "tile" is the working set of the core compute operation. If only the M
and N
dimensions are packed, and K
is just padded, then before tiling of the K
dimension, the pack
operation needs to be converted into a pad
like representation. Tiling the inner dimensions of the a pack
operation goes against the design of pack
operation AFAICS.
Thanks @MaheshRavishankar, let's split this down further as I seem to see some conflation here.
There are broadly two options for convert a matmul to a shape that is better for the codegen.
Sure, but this is not the problem that this issue refers to. The exercise here is not a perf breakdown but solving a bug; how this bug was surfaced is orthogonal IMO.
The issue is: IREE receives the IR above and crashes in IREE::LinalgExt::TilingInterfaceBaseTilingPattern
.
My understanding from offline discussion is that the issue happens in the GPU backend but that the same IR goes through the CPU backend fine.
So it seems the bug reduces to:
TilingInterfaceBaseTilingPattern
on pack/unpack
ops differently than on the CPU backend ?IREE::LinalgExt::TilingInterfaceBaseTilingPattern
crash and can we make it return an actionnable error message instead of a stack trace ?Now for the specific points looking ahead:
Add a folding pattern for ...
Yes, this is a good followup once the crash is fixed, this will remove one dispatch once we can run e2e and measure.
The next steps then are to look at having "good" code-generation for pack or pad.
If only the M and N dimensions are packed, and K is just padded, then before tiling of the K dimension, the pack operation needs to be converted into a pad like representation.
Yes, having a reasonable codegen strategy for pack and pad seems to be lacking on the GPU side, there are many low-hanging fruit opportunities. One of the avenues we have indeed been looking at is folding trivial packs into pads, which captures one of the use cases we're looking at for GPU.
Tiling the inner dimensions of the a pack operation goes against the design of pack operation AFAICS.
This deserves another full thread.
Thanks @MaheshRavishankar, let's split this down further as I seem to see some conflation here.
There are broadly two options for convert a matmul to a shape that is better for the codegen.
Sure, but this is not the problem that this issue refers to. The exercise here is not a perf breakdown but solving a bug; how this bug was surfaced is orthogonal IMO.
The issue is: IREE receives the IR above and crashes in
IREE::LinalgExt::TilingInterfaceBaseTilingPattern
. My understanding from offline discussion is that the issue happens in the GPU backend but that the same IR goes through the CPU backend fine.So it seems the bug reduces to:
- how does the GPU backend use
TilingInterfaceBaseTilingPattern
onpack/unpack
ops differently than on the CPU backend ?- why does
IREE::LinalgExt::TilingInterfaceBaseTilingPattern
crash and can we make it return an actionnable error message instead of a stack trace ?
This part is confusing to me. I dont see any uses of IREE::LinalgExtTilingInterfaceBaseTilingPattern on the CPU backend. That is really legacy and should not be used. AFAIK, CPU side doesnt use that and that might be the difference.
From GPU sync: Nicolas out for 2 weeks, discuss priority offline.
@MaheshRavishankar @mattwalsh We should discuss priority and resourcing for this issue during GPU sync.
From meeting today: holding this open for Nicolas when he returns to work, not a blocking item.
What happened?
I have been iterating on graph-level transformations and added this repro to iree-samples.
This is dependent on #12822
Running
iree-compile --iree-hal-target-backends=cuda transform_dialect/graph/cuda/bugs/1-ir.mlir
crashes with:
Steps to reproduce your issue
What component(s) does this issue relate to?
No response
Version information
No response
Additional context
No response