Closed gpetters94 closed 3 months ago
Can you try to pass it through iree-reduce?
So further investigation shows that this is the offending IR/dispatch (link). We can simply repro this IR by doing:
wget https://gist.githubusercontent.com/raikonenfnu/c0efc1f81f717914ce91b29e10514efe/raw/f80f1c6c7c215e29c6275e585207cf61aa5a07d9/compilable_stencil_broken_dispatch.mlir
/path/to/tools/iree-compile compilable_stencil_broken_dispatch.mlir --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan -iree-vulkan-target-triple=rdna3-unknown-linux --iree-rocm-target-chip=gfx1100 --iree-rocm-link-bc=true --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host --iree-stream-resource-max-allocation-size=4294967295 --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-opt-strip-assertions=true --verify=true --iree-vm-target-truncate-unsupported-floats --iree-codegen-check-ir-before-llvm-conversion=false --iree-vm-bytecode-module-output-format=flatbuffer-binary -o dispatch.vmfb
Seems to fix this, we'd need to add support for multi-dim distribution for scf.forOp (link) . By adding the logic from this patch (link).
The current missing piece for using the logic from the patch is the existing patch/pattern expects distributedType to be pre-determined as the type in the logic as the resultType from the warpExecuteOnLan0Op as seen in these examples (link). However for our current IR this is not the case:
vector.warp_execute_on_lane_0(%0)[256] {
%6 = scf.for %arg0 = %c0 to %c40 step %c8 iter_args(%arg1 = %cst_0) -> (vector<8x128xf32>) {
%15 = scf.for %arg2 = %c0 to %c384 step %c128 iter_args(%arg3 = %arg1) -> (vector<8x128xf32>) {
%16 = vector.transfer_read %1[%workgroup_id_y, %workgroup_id_x, %arg0, %arg2], %cst_3 {in_bounds = [true, true]} : memref<2x32x40x384xf32, #hal.descriptor_type<storage_buffer>>, vector<8x128xf32>
%17 = arith.addf %16, %arg3 : vector<8x128xf32>
scf.yield %17 : vector<8x128xf32>
}
scf.yield %15 : vector<8x128xf32>
}
%7 = vector.shape_cast %6 : vector<8x128xf32> to vector<1024xf32>
%8 = vector.reduction <add>, %7, %cst_3 : vector<1024xf32> into f32
%9 = vector.broadcast %8 : f32 to vector<8x128xf32>
%10 = arith.divf %9, %cst : vector<8x128xf32>
%11 = arith.truncf %10 : vector<8x128xf32> to vector<8x128xf16>
%12 = vector.broadcast %5 : f16 to vector<8x128xf16>
%13 = arith.addf %11, %12 : vector<8x128xf16>
%14 = math.rsqrt %13 : vector<8x128xf16>
scf.for %arg0 = %c0 to %c40 step %c8 {
%15 = vector.transfer_read %3[%workgroup_id_y, %workgroup_id_x], %cst_3 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1) -> (0, 0)>} : memref<2x32xf32, #hal.descriptor_type<storage_buffer>>, vector<8x128xf32>
%16 = arith.truncf %15 : vector<8x128xf32> to vector<8x128xf16>
scf.for %arg1 = %c0 to %c384 step %c128 {
%17 = vector.transfer_read %2[%workgroup_id_y, %workgroup_id_x, %arg0, %arg1], %cst_2 {in_bounds = [true, true]} : memref<2x32x40x384xf16, #hal.descriptor_type<storage_buffer>>, vector<8x128xf16>
%18 = arith.subf %17, %16 : vector<8x128xf16>
%19 = arith.mulf %18, %14 : vector<8x128xf16>
vector.transfer_write %19, %4[%workgroup_id_y, %workgroup_id_x, %arg0, %arg1] {in_bounds = [true, true]} : vector<8x128xf16>, memref<2x32x40x384xf16, #hal.descriptor_type<storage_buffer>>
}
}
}
Hence we'd also need to determine that. A potential method is to flatten out the dimensions, and divide by the groupSize to determine the vector size. And treat the other dimensions as tiled by 1.
THis is probably related to https://github.com/openxla/iree/issues/15088 as well.
Yes, with #14656 reverted we won't see such dispatch anymore. So this buys us sometime to fix the underlying issue.
The warp_execute_on_lane_0
op region works as the boundary between the SIMD (where one execution thread handling all data elements) and the SIMT (where multiple threads each handling a subset of the data elements) model--inside it's SIMD and outside it's SIMT. The conversion flow works by: we initially place all ops inside the warp_execute_on_lane_0
op, and then gradually move stuff out side of the region by distributing them to threads, starting from the last op in the region.
Here the specific case we see a scf.for
op as the last op in the initially formed warp_execute_on_lane_0
region. There are patterns that can handle scf.for
ops yielding values. I think it should be extended to also handle the case here, which is actually simpler--the transformation logic keeps the same. But we do need to make the rest ops to work with n-D distribution. Specifically, some key pointers:
We need to support n-D map here. Then in upstream MLIR, the scf.for
lowering pattern should be extended to support n-D distribution in getDistributedType
, using a similar fashion as we do in delinearizeLaneId
to delinearize and distribute the warp size to n-D.
https://github.com/llvm/llvm-project/pull/71193
This patch adds the support required to getDistributedType. After that, the n-D map support is simple:
diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/VectorReductionToGPU.cpp b/compiler/src/iree/compiler/Codegen/Common/GPU/VectorReductionToGPU.cpp
index 9f8c7e54b..0c75059ae 100644
--- a/compiler/src/iree/compiler/Codegen/Common/GPU/VectorReductionToGPU.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/GPU/VectorReductionToGPU.cpp
@@ -247,7 +247,9 @@ public:
// complex cases.
int64_t vecRank = vecType.getRank();
OpBuilder builder(val.getContext());
- map = AffineMap::get(vecRank, 0, builder.getAffineDimExpr(vecRank - 1));
+
+ map = AffineMap::getMinorIdentityMap(vecRank, vecRank,
+ builder.getContext());
return map;
};
RewritePatternSet patterns(ctx);
This should be handled with the new vector distribution pipeline instead.
What happened?
I'm trying to compile a tm_tensor model through iree-compile, and this is the output:
Steps to reproduce your issue
iree-compile - --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --mlir-pass-pipeline-crash-reproducer=ireetmp/core-reproducer.mlir --iree-llvmcpu-target-cpu-features=host --iree-vulkan-target-env=#vk.target_env<v1.3, r(120), [VK_KHR_16bit_storage, VK_KHR_8bit_storage, VK_KHR_shader_float16_int8, VK_KHR_spirv_1_4, VK_KHR_storage_buffer_storage_class, VK_KHR_variable_pointers, VK_EXT_subgroup_size_control], AMD:DiscreteGPU, #vk.caps< maxComputeSharedMemorySize = 65536, maxComputeWorkGroupInvocations = 1024, maxComputeWorkGroupSize = dense<[1024, 1024, 1024]>: vector<3xi32>, subgroupSize = 64, subgroupFeatures = 255: i32, minSubgroupSize = 32, maxSubgroupSize = 64, shaderFloat16 = unit, shaderFloat64 = unit, shaderInt8 = unit, shaderInt16 = unit, shaderInt64 = unit, storageBuffer16BitAccess = unit, storagePushConstant16 = unit, uniformAndStorageBuffer16BitAccess = unit, storageBuffer8BitAccess = unit, storagePushConstant8 = unit, uniformAndStorageBuffer8BitAccess = unit, variablePointers = unit, variablePointersStorageBuffer = unit, shaderIntegerDotProduct = unit >> --iree-stream-resource-max-allocation-size=4294967295 --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-opt-strip-assertions=true --verify=false -iree-vulkan-target-triple=rdna2-unknown-linux --iree-preprocessing-pass-pipeline=builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32})) model.mlir
What component(s) does this issue relate to?
Compiler
Version information
iree-compile version is 20230930.661 @ 0af63addc8c2ea0bf903391a877bd60f52a10f73
Additional context
Here is the model.mlir file.