Closed pdhirajkumarprasad closed 1 week ago
This is an issue in a unpack + elementwise dispatch. the elementwise gets tiled but the unpack does not, here is the IR dump for the dispatch @hanhanW any idea what needs to be done here?
There is a tensor.extract_slice created after distribution, which blocks the further TileAndFuse. So it generates large vectors. Someone needs to check why the extract_slice is created and fix it.
If it is not fixable, we can try adding the FoldUnpackWithExtractSliceOp
pattern. You can populate it from https://github.com/llvm/llvm-project/blob/428ae0f12e29eff1ddcaf59bdcce904ec056963e/mlir/lib/Dialect/Tensor/Transforms/PackAndUnpackPatterns.cpp#L484-L491
// -----// IR Dump After LowerExecutableUsingTransformDialectPass (iree-codegen-lower-executable-using-transform-dialect) //----- //
module {
func.func @torch_jit$async_dispatch_24_unpack_elementwise_1x1000_f32_dispatch_0_unpack_elementwise_1x1000_f32() attributes {translation_info = #iree_codegen.translation_info<CPUDoubleTilingExpert>} {
%cst = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>>
%1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>>
%2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [1, 250, 8, 4], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>> -> tensor<1x250x8x4xf32>
%3 = tensor.empty() : tensor<1x1000xf32>
%unpack = tensor.unpack %2 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 4] into %3 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} : tensor<1x250x8x4xf32> -> tensor<1x1000xf32>
%4 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%unpack : tensor<1x1000xf32>) outs(%3 : tensor<1x1000xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} {
^bb0(%in: f32, %out: f32):
%5 = arith.addf %in, %cst : f32
linalg.yield %5 : f32
} -> tensor<1x1000xf32>
flow.dispatch.tensor.store %4, %1, offsets = [0, 0], sizes = [1, 1000], strides = [1, 1] : tensor<1x1000xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>>
return
}
}
// -----// IR Dump After TileAndDistributeToWorkgroupsPass (iree-codegen-tile-and-distribute-to-workgroups) //----- //
func.func @torch_jit$async_dispatch_24_unpack_elementwise_1x1000_f32_dispatch_0_unpack_elementwise_1x1000_f32() attributes {translation_info = #iree_codegen.translation_info<CPUDoubleTilingExpert>} {
%c250 = arith.constant 250 : index
%c1000 = arith.constant 1000 : index
%cst = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>>
%1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>>
%2 = flow.dispatch.tensor.load %0, offsets = [0, %c0, 0, 0], sizes = [1, %c250, 8, 4], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>> -> tensor<1x?x8x4xf32>
%3 = tensor.empty() : tensor<8x1000xf32>
%unpack = tensor.unpack %2 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 4] into %3 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} : tensor<1x?x8x4xf32> -> tensor<8x1000xf32>
%4 = tensor.empty() : tensor<1x1000xf32>
%extracted_slice = tensor.extract_slice %unpack[0, 0] [1, 1000] [1, 1] : tensor<8x1000xf32> to tensor<1x1000xf32>
%5 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<1x1000xf32>) outs(%4 : tensor<1x1000xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} {
^bb0(%in: f32, %out: f32):
%6 = arith.addf %in, %cst : f32
linalg.yield %6 : f32
} -> tensor<1x1000xf32>
%cast = tensor.cast %5 : tensor<1x1000xf32> to tensor<1x?xf32>
flow.dispatch.tensor.store %cast, %1, offsets = [0, %c0], sizes = [1, %c1000], strides = [1, 1] : tensor<1x?xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>>
return
}
good point, which made me notice, isnt this unpack wrong?
%unpack = tensor.unpack %0
outer_dims_perm = [0, 1]
inner_dims_pos = [0, 1]
inner_tiles = [8, 4] into %1 : tensor<1x250x8x4xf32> -> tensor<1x1000xf32>
it should be
%unpack = tensor.unpack %0
outer_dims_perm = [0, 1]
inner_dims_pos = [0, 1]
inner_tiles = [8, 4] into %1 : tensor<1x250x8x4xf32> -> tensor<8x1000xf32>
This issue seems to be a friend of #18603 we have
%115 = linalg.generic
{indexing_maps = [#map1, #map1, #map1, #map1], iterator_types = []}
ins(%112, %113, %114 : tensor<i1>, tensor<i64>, tensor<i64>) outs(%27 : tensor<i64>) {
^bb0(%in: i1, %in_76: i64, %in_77: i64, %out: i64):
%216 = arith.select %in, %in_76, %in_77 : i64
linalg.yield %216 : i64
} -> tensor<i64>
...
%extracted_61 = tensor.extract %115[] : tensor<i64>
...
%131 = arith.index_cast %extracted_61 : i64 to index
%212 = iree_encoding.unset_encoding
%211 : tensor<?x1000xf32, #iree_encoding.encoding<operand_index = 2 : index,
op_type = matmul, element_types = [f32, f32, f32],
user_indexing_maps = [#map23, #map24, #map25],
round_dims_to = array<i64: 32, 32, 32>>> -> tensor<?x1000xf32>
%extracted_slice_75 = tensor.extract_slice %212[0, 0] [%131, 1000] [1, 1] : tensor<?x1000xf32> to tensor<?x1000xf32>
%213 = linalg.generic
{indexing_maps = [#map18, #map21, #map18],
iterator_types = ["parallel", "parallel"]}
ins(%extracted_slice_75, %cst_16 : tensor<?x1000xf32>, tensor<1000xf32>)
outs(%206 : tensor<?x1000xf32>) {
^bb0(%in: f32, %in_76: f32, %out: f32):
%216 = arith.addf %in, %in_76 : f32
linalg.yield %216 : f32
} -> tensor<?x1000xf32>
cc @zjgarvey
This unpack is valid because there is extract_slice semantic in unpack ops. You can think that it is an inverse operation of pack op. The pack op has padding semantics, and the unpack op has extract_slice semantics. It is valid to fold unpack -> extract_slice
into a single unpack
op. One of the ideas of having destination tensor for unpack op is that it describes the shape.
%unpack = tensor.unpack %0
outer_dims_perm = [0, 1]
inner_dims_pos = [0, 1]
inner_tiles = [8, 4] into %1 : tensor<1x250x8x4xf32> -> tensor<1x1000xf32>
This unpack is valid because there is extract_slice semantic in unpack ops. You can think that it is an inverse operation of pack op. The pack op has padding semantics, and the unpack op has extract_slice semantics. It is valid to fold
unpack -> extract_slice
into a singleunpack
op. One of the ideas of having destination tensor for unpack op is that it describes the shape.%unpack = tensor.unpack %0 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 4] into %1 : tensor<1x250x8x4xf32> -> tensor<1x1000xf32>
Thanks that makes sense, however not sure if we intended to reach this unpack with the extract slice semantics or its a bug because of having shape computation encoded in tensor math.
@nirvedhmeshram I'll focus on getting the where.self op to return scalar arithmetic when possible.
@nirvedhmeshram I'll focus on getting the where.self op to return scalar arithmetic when possible.
Sounds good, I will check if we want to support unpack with extract slice fusion as well and either add that support or disable this fusion in such cases based on where that discussion goes.
Yes, it is intended. It is not a bug. I guess the behavior is triggered by the tiling implementation: https://github.com/llvm/llvm-project/blob/fc4b1a303b296d02f6243a083510c4ee7f290ab0/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp#L561-L588
Looking at the implementation, I think the issue is that it is not treated as perfect tiling case. We can try to enhance the logic. My guess is that the value of tileSize[0]
is 1
(which is the output shape). However, it is not a multiple of inner tile size (which is 8 in this case). So it goes with non-perfect-tiling path.
One way to enhance the logic is passing the size of destination tensor to the getUnpackTileDimInfo function. If the sizes match, it returns the perfect tiling config.
There is a tensor.extract_slice created after distribution, which blocks the further TileAndFuse. So it generates large vectors. Someone needs to check why the extract_slice is created and fix it.
If it is not fixable, we can try adding the
FoldUnpackWithExtractSliceOp
pattern. You can populate it from https://github.com/llvm/llvm-project/blob/428ae0f12e29eff1ddcaf59bdcce904ec056963e/mlir/lib/Dialect/Tensor/Transforms/PackAndUnpackPatterns.cpp#L484-L491// -----// IR Dump After LowerExecutableUsingTransformDialectPass (iree-codegen-lower-executable-using-transform-dialect) //----- // module { func.func @torch_jit$async_dispatch_24_unpack_elementwise_1x1000_f32_dispatch_0_unpack_elementwise_1x1000_f32() attributes {translation_info = #iree_codegen.translation_info
} { %cst = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index %0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>> %1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>> %2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [1, 250, 8, 4], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>> -> tensor<1x250x8x4xf32> %3 = tensor.empty() : tensor<1x1000xf32> %unpack = tensor.unpack %2 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 4] into %3 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} : tensor<1x250x8x4xf32> -> tensor<1x1000xf32> %4 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%unpack : tensor<1x1000xf32>) outs(%3 : tensor<1x1000xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} { ^bb0(%in: f32, %out: f32): %5 = arith.addf %in, %cst : f32 linalg.yield %5 : f32 } -> tensor<1x1000xf32> flow.dispatch.tensor.store %4, %1, offsets = [0, 0], sizes = [1, 1000], strides = [1, 1] : tensor<1x1000xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>> return } } // -----// IR Dump After TileAndDistributeToWorkgroupsPass (iree-codegen-tile-and-distribute-to-workgroups) //----- // func.func @torch_jit$async_dispatch_24_unpack_elementwise_1x1000_f32_dispatch_0_unpack_elementwise_1x1000_f32() attributes {translation_info = #iree_codegen.translation_info
} { %c250 = arith.constant 250 : index %c1000 = arith.constant 1000 : index %cst = arith.constant 0.000000e+00 : f32 %c0 = arith.constant 0 : index %0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>> %1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>> %2 = flow.dispatch.tensor.load %0, offsets = [0, %c0, 0, 0], sizes = [1, %c250, 8, 4], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>> -> tensor<1x?x8x4xf32> %3 = tensor.empty() : tensor<8x1000xf32> %unpack = tensor.unpack %2 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 4] into %3 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} : tensor<1x?x8x4xf32> -> tensor<8x1000xf32> %4 = tensor.empty() : tensor<1x1000xf32> %extracted_slice = tensor.extract_slice %unpack[0, 0] [1, 1000] [1, 1] : tensor<8x1000xf32> to tensor<1x1000xf32> %5 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<1x1000xf32>) outs(%4 : tensor<1x1000xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} { ^bb0(%in: f32, %out: f32): %6 = arith.addf %in, %cst : f32 linalg.yield %6 : f32 } -> tensor<1x1000xf32> %cast = tensor.cast %5 : tensor<1x1000xf32> to tensor<1x?xf32> flow.dispatch.tensor.store %cast, %1, offsets = [0, %c0], sizes = [1, %c1000], strides = [1, 1] : tensor<1x?xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>> return }
This is just merely based on the observation of this IR. Since we are adding zeros, can't we DCE the linalg.generic ?
This is just merely based on the observation of this IR. Since we are adding zeros, can't we DCE the linalg.generic ?
Good question. I honestly don't know where should it happen. It is not easy to identify these cases (e.g., transpose, etc.) at Linalg level, so we typically rely on ConstEval. It's easier if we can do it at higher level (like arith, or input dialects).
This is just merely based on the observation of this IR. Since we are adding zeros, can't we DCE the linalg.generic ?
Good question. I honestly don't know where should it happen. It is not easy to identify these cases (e.g., transpose, etc.) at Linalg level, so we typically rely on ConstEval. It's easier if we can do it at higher level (like arith, or input dialects).
Actually, this came up in a voice meeting we had last week and we still want to support this dispatch because often the constant is zero due to fake weights but might not be zero in real use cases.
This is fixed with the latest pull. PTAL.
this issue is not seen anymore with latest so closing this
What happened?
For the attached IR, seeing error as
command:
and this linalg.mlir was generated with following command:
linalg.mlir.txt model.torch_onnx.mlir.txt
Steps to reproduce your issue
Mentioned above
What component(s) does this issue relate to?
Compiler
Version information
No response
Additional context
No response