iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.54k stars 565 forks source link

Unknown error on Vulkan backend #17060

Open gpetters-amd opened 4 months ago

gpetters-amd commented 4 months ago

What happened?

The runtime causes a device failure on AMD 780M. It looks like some kind of memory issue, but unusually it's not an allocation that fails, but a deallocation.

EXEC @main
D:\a\iree\iree\c\runtime\src\iree\hal\drivers\vulkan\direct_command_queue.cc:114: UNKNOWN; VkResult=4294967283; while invoking native function hal.device.queue.dealloca; while calling import;
[ 2]   native hal.device.queue.dealloca:0 -
[ 1] bytecode compiled_vae.main$async:27102 tmp.txt:251:3
[ 0] bytecode compiled_vae.main:62 tmp.txt:251:3; invoking function 'main'

The reproducer is 170MB, so I can't upload it. Ask me and I'll send it to anyone trying to reproduce it.

Steps to reproduce your issue

  1. iree-compile tmp.txt --iree-vulkan-target-triple=rdna2-unknown-windows --iree-stream-resource-index-bits=64 --iree-hal-target-backends=vulkan-spirv -o tmp.vmfb
  2. iree-run-module --device=vulkan --function=main --input='1x4x64x64xf16' --module=tmp.vmfb

What component(s) does this issue relate to?

Runtime

Version information

b4273a4bfc66ba6dd8f62f6483d74d42a7b936f1

Additional context

No response

gpetters-amd commented 4 months ago

Here's the reproducer.

ScottTodd commented 3 months ago

FWIW, I tried to reproduce this on my machine (NVIDIA 2080TI GPU) both without --iree-vulkan-target-triple and with --iree-vulkan-target-triple=turing-unknown-windows. Both of those failed to compile, making this tricky to help with as long as the pipeline is this brittle.

With turing-unknown-windows:

λ D:\dev\projects\iree-build\tools\iree-compile.exe D:\dev\projects\iree-tmp\issue_17060.mlir --iree-vulkan-target-triple=turing-unknown-windows --iree-stream-resource-index-bits=64 --iree-hal-target-backends=vulkan-spirv --iree-hal-executable-debug-level=3 -o D:\dev\projects\iree-tmp\issue_17060.vmfb
failed to translate executables
failed to translate executables
failed to translate executables
<unknown>:0: error: operands must be in the order AOp, BOp, COp
<unknown>:0: note: see current operation: %78 = "gpu.subgroup_mma_compute"(%54, %70, %arg5) : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">
D:\dev\projects\iree-tmp\issue_17060.mlir:578:8: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"vulkan-spirv"

With no target triple (conservative default):

λ D:\dev\projects\iree-build\tools\iree-compile.exe D:\dev\projects\iree-tmp\issue_17060.mlir --iree-stream-resource-index-bits=64 --iree-hal-target-backends=vulkan-spirv --iree-hal-executable-debug-level=3 -o D:\dev\projects\iree-tmp\issue_17060.vmfb
failed to translate executables
failed to translate executables
failed to translate executables
failed to translate executables
failed to translate executables
failed to translate executables
D:\dev\projects\iree-tmp\issue_17060.mlir:1509:8: error: failed to legalize operation 'arith.fptosi' that was explicitly marked illegal
%577 = torch.prims.convert_element_type %576, %int4 : !torch.vtensor<[128],f32>, !torch.int -> !torch.vtensor<[128],si64>
       ^

... (other similar errors) ...

D:\dev\projects\iree-tmp\issue_17060.mlir:310:22: error: 'func.func' op uses 8388736 bytes of shared memory; exceeded the limit of 16384 bytes
%result0, %result1 = torch.aten.var_mean.correction %17, %18, %int0_18, %true : !torch.vtensor<[1,32,16,4096],f32>, !torch.list<int>, !torch.int, !torch.bool -> !torch.vtensor<[1,32,1,1],f32>, !torch.vtensor<[1,32,1,1],f32>
                     ^
D:\dev\projects\iree-tmp\issue_17060.mlir:253:6: note: called from
%1 = call @decode_inp(%0) : (!torch.vtensor<[1,4,64,64],f16>) -> !torch.vtensor<[1,128,512,512],f32>
     ^
powderluv commented 3 months ago

Does it need to be tuned for 780M shared memory sizes ?

gpetters-amd commented 3 months ago

Does it need to be tuned for 780M shared memory sizes ?

It's happening on 7900s now too, so I don't think it's a hardware issue. Maybe it's a driver thing, dunno how we could effectively test that, though.

antiagainst commented 3 months ago

780M is RDNA3. I'm need to set up dev env on my machine and various meetings so won't get to this til later today or tomorrow. In the meanwhile can you try to compile with rdna3-unknown-unknwon and run?

gpetters-amd commented 3 months ago

780M is RDNA3. I'm need to set up dev env on my machine and various meetings so won't get to this til later today or tomorrow. In the meanwhile can you try to compile with rdna3-unknown-unknwon and run?

Yep, getting the same compile error that @ScottTodd has compiling to turing-unknown-unknown.

antiagainst commented 3 months ago

Okay finally I can repro the originally reported runtime issue. It is a driver timeout. Likely due to the weak igpu and we are also not codegen using wmma ops due to rdna2 triple. It's a pain for me to work with the corp machine for development I have right now with 780M--lots of restrictions and I still cannot have a functioning toolchain (both msvc and clang broke for various reasons) so need to build on another windows machine and copy over. I'd need to figure out a better story to play with it. A few things to try out:

  1. @gpetters-amd can you try to increase the timeout threshold on windows following https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys ? I cannot do it on my side because cannot modify registers. Just wanted to confirm the issue.
  2. If the above works (even if it doesn't work), try to capture a tracy profile to see which kernel we are being particularly slow.
  3. We need to fix the compilation issue for rdna3 to generate faster code for the igpu.
antiagainst commented 3 months ago

Pasting the problematic dispatch here. Repro with tools/iree-compile --compile-from=executable-configurations:

hal.executable public @main$async_dispatch_20 {
  hal.executable.variant public @vulkan_spirv_fb target(<"vulkan-spirv", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, PhysicalStorageBufferAddresses, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixKHR], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_integer_dot_product, SPV_KHR_storage_buffer_storage_class, SPV_KHR_physical_storage_buffer, SPV_KHR_variable_pointers, SPV_KHR_cooperative_matrix]>, api=Vulkan, AMD:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 65536, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 1024], subgroup_size = 64, min_subgroup_size = 32, max_subgroup_size = 64, cooperative_matrix_properties_khr = [#spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = i8, b_type = i8, c_type = i32, result_type = i32, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f32, result_type = f32, acc_sat = false, scope = <Subgroup>>]>>}>) {
    hal.executable.export public @main$async_dispatch_20_matmul_transpose_b_4096x512x512_f16xf16xf32 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>]>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>, #hal.interface.binding<0, 2>]} {
    ^bb0(%arg0: !hal.device):
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @main$async_dispatch_20_matmul_transpose_b_4096x512x512_f16xf16xf32() attributes {translation_info = #iree_codegen.translation_info<SPIRVCooperativeMatrixVectorize workgroup_size = [64, 2, 1] subgroup_size = 32, {pipeline_depth = 1 : i64, store_stage = 0 : i64}>} {
        %cst = arith.constant 0.000000e+00 : f32
        %c128 = arith.constant 128 : index
        %c86398720 = arith.constant 86398720 : index
        %c86397696 = arith.constant 86397696 : index
        %c16877632 = arith.constant 16877632 : index
        %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c128) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4096x512xf16>>
        %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c86398720) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<512x512xf16>>
        %2 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c86397696) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<512xf16>>
        %3 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c16877632) : !flow.dispatch.tensor<writeonly:tensor<4096x512xf16>>
        %4 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [4096, 512], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<4096x512xf16>> -> tensor<4096x512xf16>
        %5 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [512, 512], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<512x512xf16>> -> tensor<512x512xf16>
        %6 = flow.dispatch.tensor.load %2, offsets = [0], sizes = [512], strides = [1] : !flow.dispatch.tensor<readonly:tensor<512xf16>> -> tensor<512xf16>
        %7 = tensor.empty() : tensor<4096x512xf16>
        %8 = tensor.empty() : tensor<4096x512xf32>
        %9 = linalg.fill ins(%cst : f32) outs(%8 : tensor<4096x512xf32>) -> tensor<4096x512xf32>
        %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%4, %5 : tensor<4096x512xf16>, tensor<512x512xf16>) outs(%9 : tensor<4096x512xf32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 128], [32, 64], [0, 0, 32], [16, 16, 16]]>} {
        ^bb0(%in: f16, %in_0: f16, %out: f32):
          %12 = arith.extf %in : f16 to f32
          %13 = arith.extf %in_0 : f16 to f32
          %14 = arith.mulf %12, %13 : f32
          %15 = arith.addf %out, %14 : f32
          linalg.yield %15 : f32
        } -> tensor<4096x512xf32>
        %11 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%10, %6 : tensor<4096x512xf32>, tensor<512xf16>) outs(%7 : tensor<4096x512xf16>) {
        ^bb0(%in: f32, %in_0: f16, %out: f16):
          %12 = arith.truncf %in : f32 to f16
          %13 = arith.addf %12, %in_0 : f16
          linalg.yield %13 : f16
        } -> tensor<4096x512xf16>
        flow.dispatch.tensor.store %11, %3, offsets = [0, 0], sizes = [4096, 512], strides = [1, 1] : tensor<4096x512xf16> -> !flow.dispatch.tensor<writeonly:tensor<4096x512xf16>>
        return
      }
    }
  }
}
antiagainst commented 3 months ago

It seems the issue is inferFragType not seeing through arith.extf ops.

antiagainst commented 3 months ago

https://github.com/llvm/llvm-project/pull/91988 for fixing the compilation to make it compilable for rdna3.

antiagainst commented 3 months ago

https://github.com/llvm/llvm-project/pull/91988 is landed. Just need an llvm integration to pull it in: https://github.com/iree-org/iree/pull/17380