Failed vector distribution for n-D vectors in fused reduction + consumer dispatch

gpetters94 commented 1 year ago

What happened?

I'm trying to compile a tm_tensor model through iree-compile, and this is the output:

Please report issues to https://github.com/openxla/iree/issues and include the crash backtrace.
 #0 0x00007fe2c466a147 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x406a147)
 #1 0x00007fe2c4667e8e llvm::sys::RunSignalHandlers() (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x4067e8e)
 #2 0x00007fe2c466a80f SignalHandler(int) Signals.cpp:0:0
 #3 0x00007fe2c023e710 (/usr/lib/libc.so.6+0x3e710)
 #4 0x00007fe2c6b1d908 (anonymous namespace)::WarpOpElementwise::matchAndRewrite(mlir::vector::WarpExecuteOnLane0Op, mlir::PatternRewriter&) const VectorDistribute.cpp:0:0
 #5 0x00007fe2c760488d void llvm::function_ref<void ()>::callback_fn<mlir::PatternApplicator::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&, llvm::function_ref<bool (mlir::Pattern const&)>, llvm::function_ref<void (mlir::Pattern const&)>, llvm::function_ref<mlir::LogicalResult (mlir::Pattern const&)>)::$_2>(long) PatternApplicator.cpp:0:0
 #6 0x00007fe2c7601ad3 mlir::PatternApplicator::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&, llvm::function_ref<bool (mlir::Pattern const&)>, llvm::function_ref<void (mlir::Pattern const&)>, llvm::function_ref<mlir::LogicalResult (mlir::Pattern const&)>) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x7001ad3)
 #7 0x00007fe2c75d5f15 (anonymous namespace)::GreedyPatternRewriteDriver::processWorklist() GreedyPatternRewriteDriver.cpp:0:0
 #8 0x00007fe2c75d3ce4 mlir::applyPatternsAndFoldGreedily(mlir::Region&, mlir::FrozenRewritePatternSet const&, mlir::GreedyRewriteConfig, bool*) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x6fd3ce4)
 #9 0x00007fe2c60b43f4 mlir::iree_compiler::(anonymous namespace)::VectorReduceToGPUPass::runOnOperation() VectorReductionToGPU.cpp:0:0
#10 0x00007fe2c47f04b9 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f04b9)
#11 0x00007fe2c47f0e98 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f0e98)
#12 0x00007fe2c47f2c16 mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f2c16)
#13 0x00007fe2c47f06ac mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f06ac)
#14 0x00007fe2c47f0e98 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f0e98)
#15 0x00007fe2c47f2c16 mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f2c16)
#16 0x00007fe2c47f06ac mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f06ac)
#17 0x00007fe2c47f0e98 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f0e98)
#18 0x00007fe2c47f610a mlir::LogicalResult llvm::function_ref<mlir::LogicalResult (mlir::OpPassManager&, mlir::Operation*)>::callback_fn<mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int)::$_6>(long, mlir::OpPassManager&, mlir::Operation*) Pass.cpp:0:0
#19 0x00007fe2c5b27737 mlir::iree_compiler::(anonymous namespace)::SPIRVLowerExecutableTargetPass::runOnOperation() SPIRVLowerExecutableTargetPass.cpp:0:0
#20 0x00007fe2c47f04b9 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f04b9)
#21 0x00007fe2c47f0e98 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f0e98)
#22 0x00007fe2c47f610a mlir::LogicalResult llvm::function_ref<mlir::LogicalResult (mlir::OpPassManager&, mlir::Operation*)>::callback_fn<mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int)::$_6>(long, mlir::OpPassManager&, mlir::Operation*) Pass.cpp:0:0
#23 0x00007fe2c595e7bb mlir::iree_compiler::IREE::HAL::TranslateTargetExecutableVariantsPass::runOnOperation() TranslateExecutables.cpp:0:0
#24 0x00007fe2c47f04b9 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f04b9)
#25 0x00007fe2c47f0e98 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f0e98)
#26 0x00007fe2c47f2c16 mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f2c16)
#27 0x00007fe2c47f06ac mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f06ac)
#28 0x00007fe2c47f0e98 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f0e98)
#29 0x00007fe2c47f610a mlir::LogicalResult llvm::function_ref<mlir::LogicalResult (mlir::OpPassManager&, mlir::Operation*)>::callback_fn<mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int)::$_6>(long, mlir::OpPassManager&, mlir::Operation*) Pass.cpp:0:0
#30 0x00007fe2c595f246 mlir::iree_compiler::IREE::HAL::TranslateExecutablesPass::runOnOperation() TranslateExecutables.cpp:0:0
#31 0x00007fe2c47f04b9 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f04b9)
#32 0x00007fe2c47f0e98 mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x41f0e98)
#33 0x00007fe2c47f6db3 std::_Function_handler<void (), mlir::LogicalResult mlir::failableParallelForEach<__gnu_cxx::__normal_iterator<mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo*, std::vector<mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo, std::allocator<mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo>>>, mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::$_15&>(mlir::MLIRContext*, __gnu_cxx::__normal_iterator<mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo*, std::vector<mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo, std::allocator<mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo>>>, __gnu_cxx::__normal_iterator<mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo*, std::vector<mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo, std::allocator<mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::OpPMInfo>>>, mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::$_15&)::'lambda'()>::_M_invoke(std::_Any_data const&) Pass.cpp:0:0
#34 0x00007fe2c474f662 std::_Function_handler<void (), llvm::ThreadPool::createTaskAndFuture(std::function<void ()>)::'lambda'()>::_M_invoke(std::_Any_data const&) Verifier.cpp:0:0
#35 0x00007fe2c46257fb llvm::ThreadPool::processTasks(llvm::ThreadPoolTaskGroup*) (/home/gap/work/SHARK/venv/lib/python3.11/site-packages/iree/compiler/_mlir_libs/libIREECompiler.so+0x40257fb)
#36 0x00007fe2c462631c void* llvm::thread::ThreadProxy<std::tuple<llvm::ThreadPool::grow(int)::$_0>>(void*) ThreadPool.cpp:0:0
#37 0x00007fe2c028c9eb (/usr/lib/libc.so.6+0x8c9eb)
#38 0x00007fe2c031123c (/usr/lib/libc.so.6+0x11123c)

Steps to reproduce your issue

Run iree-compile - --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --mlir-pass-pipeline-crash-reproducer=ireetmp/core-reproducer.mlir --iree-llvmcpu-target-cpu-features=host --iree-vulkan-target-env=#vk.target_env<v1.3, r(120), [VK_KHR_16bit_storage, VK_KHR_8bit_storage, VK_KHR_shader_float16_int8, VK_KHR_spirv_1_4, VK_KHR_storage_buffer_storage_class, VK_KHR_variable_pointers, VK_EXT_subgroup_size_control], AMD:DiscreteGPU, #vk.caps< maxComputeSharedMemorySize = 65536, maxComputeWorkGroupInvocations = 1024, maxComputeWorkGroupSize = dense<[1024, 1024, 1024]>: vector<3xi32>, subgroupSize = 64, subgroupFeatures = 255: i32, minSubgroupSize = 32, maxSubgroupSize = 64, shaderFloat16 = unit, shaderFloat64 = unit, shaderInt8 = unit, shaderInt16 = unit, shaderInt64 = unit, storageBuffer16BitAccess = unit, storagePushConstant16 = unit, uniformAndStorageBuffer16BitAccess = unit, storageBuffer8BitAccess = unit, storagePushConstant8 = unit, uniformAndStorageBuffer8BitAccess = unit, variablePointers = unit, variablePointersStorageBuffer = unit, shaderIntegerDotProduct = unit >> --iree-stream-resource-max-allocation-size=4294967295 --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-opt-strip-assertions=true --verify=false -iree-vulkan-target-triple=rdna2-unknown-linux --iree-preprocessing-pass-pipeline=builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32})) model.mlir

What component(s) does this issue relate to?

Compiler

Version information

iree-compile version is 20230930.661 @ 0af63addc8c2ea0bf903391a877bd60f52a10f73

Additional context

Here is the model.mlir file.

Groverkss commented 1 year ago

Can you try to pass it through iree-reduce?

raikonenfnu commented 1 year ago

So further investigation shows that this is the offending IR/dispatch (link). We can simply repro this IR by doing:

wget https://gist.githubusercontent.com/raikonenfnu/c0efc1f81f717914ce91b29e10514efe/raw/f80f1c6c7c215e29c6275e585207cf61aa5a07d9/compilable_stencil_broken_dispatch.mlir

/path/to/tools/iree-compile compilable_stencil_broken_dispatch.mlir --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan -iree-vulkan-target-triple=rdna3-unknown-linux --iree-rocm-target-chip=gfx1100 --iree-rocm-link-bc=true --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host --iree-stream-resource-max-allocation-size=4294967295 --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-opt-strip-assertions=true --verify=true --iree-vm-target-truncate-unsupported-floats --iree-codegen-check-ir-before-llvm-conversion=false --iree-vm-bytecode-module-output-format=flatbuffer-binary -o dispatch.vmfb

Seems to fix this, we'd need to add support for multi-dim distribution for scf.forOp (link) . By adding the logic from this patch (link).

The current missing piece for using the logic from the patch is the existing patch/pattern expects distributedType to be pre-determined as the type in the logic as the resultType from the warpExecuteOnLan0Op as seen in these examples (link). However for our current IR this is not the case:

vector.warp_execute_on_lane_0(%0)[256] {
  %6 = scf.for %arg0 = %c0 to %c40 step %c8 iter_args(%arg1 = %cst_0) -> (vector<8x128xf32>) {
    %15 = scf.for %arg2 = %c0 to %c384 step %c128 iter_args(%arg3 = %arg1) -> (vector<8x128xf32>) {
      %16 = vector.transfer_read %1[%workgroup_id_y, %workgroup_id_x, %arg0, %arg2], %cst_3 {in_bounds = [true, true]} : memref<2x32x40x384xf32, #hal.descriptor_type<storage_buffer>>, vector<8x128xf32>
      %17 = arith.addf %16, %arg3 : vector<8x128xf32>
      scf.yield %17 : vector<8x128xf32>
    }
    scf.yield %15 : vector<8x128xf32>
  }
  %7 = vector.shape_cast %6 : vector<8x128xf32> to vector<1024xf32>
  %8 = vector.reduction <add>, %7, %cst_3 : vector<1024xf32> into f32
  %9 = vector.broadcast %8 : f32 to vector<8x128xf32>
  %10 = arith.divf %9, %cst : vector<8x128xf32>
  %11 = arith.truncf %10 : vector<8x128xf32> to vector<8x128xf16>
  %12 = vector.broadcast %5 : f16 to vector<8x128xf16>
  %13 = arith.addf %11, %12 : vector<8x128xf16>
  %14 = math.rsqrt %13 : vector<8x128xf16>
  scf.for %arg0 = %c0 to %c40 step %c8 {
    %15 = vector.transfer_read %3[%workgroup_id_y, %workgroup_id_x], %cst_3 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1) -> (0, 0)>} : memref<2x32xf32, #hal.descriptor_type<storage_buffer>>, vector<8x128xf32>
    %16 = arith.truncf %15 : vector<8x128xf32> to vector<8x128xf16>
    scf.for %arg1 = %c0 to %c384 step %c128 {
      %17 = vector.transfer_read %2[%workgroup_id_y, %workgroup_id_x, %arg0, %arg1], %cst_2 {in_bounds = [true, true]} : memref<2x32x40x384xf16, #hal.descriptor_type<storage_buffer>>, vector<8x128xf16>
      %18 = arith.subf %17, %16 : vector<8x128xf16>
      %19 = arith.mulf %18, %14 : vector<8x128xf16>
      vector.transfer_write %19, %4[%workgroup_id_y, %workgroup_id_x, %arg0, %arg1] {in_bounds = [true, true]} : vector<8x128xf16>, memref<2x32x40x384xf16, #hal.descriptor_type<storage_buffer>>
    }
  }
}

Hence we'd also need to determine that. A potential method is to flatten out the dimensions, and divide by the groupSize to determine the vector size. And treat the other dimensions as tiled by 1.

MaheshRavishankar commented 1 year ago

THis is probably related to https://github.com/openxla/iree/issues/15088 as well.

antiagainst commented 1 year ago

Yes, with #14656 reverted we won't see such dispatch anymore. So this buys us sometime to fix the underlying issue.

antiagainst commented 1 year ago

The warp_execute_on_lane_0 op region works as the boundary between the SIMD (where one execution thread handling all data elements) and the SIMT (where multiple threads each handling a subset of the data elements) model--inside it's SIMD and outside it's SIMT. The conversion flow works by: we initially place all ops inside the warp_execute_on_lane_0 op, and then gradually move stuff out side of the region by distributing them to threads, starting from the last op in the region.

Here the specific case we see a scf.for op as the last op in the initially formed warp_execute_on_lane_0 region. There are patterns that can handle scf.for ops yielding values. I think it should be extended to also handle the case here, which is actually simpler--the transformation logic keeps the same. But we do need to make the rest ops to work with n-D distribution. Specifically, some key pointers:

We need to support n-D map here. Then in upstream MLIR, the scf.for lowering pattern should be extended to support n-D distribution in getDistributedType, using a similar fashion as we do in delinearizeLaneId to delinearize and distribute the warp size to n-D.

Groverkss commented 1 year ago

https://github.com/llvm/llvm-project/pull/71193

This patch adds the support required to getDistributedType. After that, the n-D map support is simple:

diff --git a/compiler/src/iree/compiler/Codegen/Common/GPU/VectorReductionToGPU.cpp b/compiler/src/iree/compiler/Codegen/Common/GPU/VectorReductionToGPU.cpp
index 9f8c7e54b..0c75059ae 100644
--- a/compiler/src/iree/compiler/Codegen/Common/GPU/VectorReductionToGPU.cpp
+++ b/compiler/src/iree/compiler/Codegen/Common/GPU/VectorReductionToGPU.cpp
@@ -247,7 +247,9 @@ public:
         // complex cases.
         int64_t vecRank = vecType.getRank();
         OpBuilder builder(val.getContext());
-        map = AffineMap::get(vecRank, 0, builder.getAffineDimExpr(vecRank - 1));
+
+        map = AffineMap::getMinorIdentityMap(vecRank, vecRank,
+                                             builder.getContext());
         return map;
       };
       RewritePatternSet patterns(ctx);

Groverkss commented 7 months ago

This should be handled with the new vector distribution pipeline instead.

iree-org / iree