Unknown error on Vulkan backend

gpetters-amd commented 4 months ago

What happened?

The runtime causes a device failure on AMD 780M. It looks like some kind of memory issue, but unusually it's not an allocation that fails, but a deallocation.

EXEC @main
D:\a\iree\iree\c\runtime\src\iree\hal\drivers\vulkan\direct_command_queue.cc:114: UNKNOWN; VkResult=4294967283; while invoking native function hal.device.queue.dealloca; while calling import;
[ 2]   native hal.device.queue.dealloca:0 -
[ 1] bytecode compiled_vae.main$async:27102 tmp.txt:251:3
[ 0] bytecode compiled_vae.main:62 tmp.txt:251:3; invoking function 'main'

The reproducer is 170MB, so I can't upload it. Ask me and I'll send it to anyone trying to reproduce it.

Steps to reproduce your issue

iree-compile tmp.txt --iree-vulkan-target-triple=rdna2-unknown-windows --iree-stream-resource-index-bits=64 --iree-hal-target-backends=vulkan-spirv -o tmp.vmfb
iree-run-module --device=vulkan --function=main --input='1x4x64x64xf16' --module=tmp.vmfb

What component(s) does this issue relate to?

Runtime

Version information

b4273a4bfc66ba6dd8f62f6483d74d42a7b936f1

Additional context

No response

gpetters-amd commented 4 months ago

Here's the reproducer.

ScottTodd commented 3 months ago

FWIW, I tried to reproduce this on my machine (NVIDIA 2080TI GPU) both without --iree-vulkan-target-triple and with --iree-vulkan-target-triple=turing-unknown-windows. Both of those failed to compile, making this tricky to help with as long as the pipeline is this brittle.

With turing-unknown-windows:

λ D:\dev\projects\iree-build\tools\iree-compile.exe D:\dev\projects\iree-tmp\issue_17060.mlir --iree-vulkan-target-triple=turing-unknown-windows --iree-stream-resource-index-bits=64 --iree-hal-target-backends=vulkan-spirv --iree-hal-executable-debug-level=3 -o D:\dev\projects\iree-tmp\issue_17060.vmfb
failed to translate executables
failed to translate executables
failed to translate executables
<unknown>:0: error: operands must be in the order AOp, BOp, COp
<unknown>:0: note: see current operation: %78 = "gpu.subgroup_mma_compute"(%54, %70, %arg5) : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">
D:\dev\projects\iree-tmp\issue_17060.mlir:578:8: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"vulkan-spirv"

With no target triple (conservative default):

λ D:\dev\projects\iree-build\tools\iree-compile.exe D:\dev\projects\iree-tmp\issue_17060.mlir --iree-stream-resource-index-bits=64 --iree-hal-target-backends=vulkan-spirv --iree-hal-executable-debug-level=3 -o D:\dev\projects\iree-tmp\issue_17060.vmfb
failed to translate executables
failed to translate executables
failed to translate executables
failed to translate executables
failed to translate executables
failed to translate executables
D:\dev\projects\iree-tmp\issue_17060.mlir:1509:8: error: failed to legalize operation 'arith.fptosi' that was explicitly marked illegal
%577 = torch.prims.convert_element_type %576, %int4 : !torch.vtensor<[128],f32>, !torch.int -> !torch.vtensor<[128],si64>
       ^

... (other similar errors) ...

D:\dev\projects\iree-tmp\issue_17060.mlir:310:22: error: 'func.func' op uses 8388736 bytes of shared memory; exceeded the limit of 16384 bytes
%result0, %result1 = torch.aten.var_mean.correction %17, %18, %int0_18, %true : !torch.vtensor<[1,32,16,4096],f32>, !torch.list<int>, !torch.int, !torch.bool -> !torch.vtensor<[1,32,1,1],f32>, !torch.vtensor<[1,32,1,1],f32>
                     ^
D:\dev\projects\iree-tmp\issue_17060.mlir:253:6: note: called from
%1 = call @decode_inp(%0) : (!torch.vtensor<[1,4,64,64],f16>) -> !torch.vtensor<[1,128,512,512],f32>
     ^

powderluv commented 3 months ago

Does it need to be tuned for 780M shared memory sizes ?

gpetters-amd commented 3 months ago

Does it need to be tuned for 780M shared memory sizes ?

It's happening on 7900s now too, so I don't think it's a hardware issue. Maybe it's a driver thing, dunno how we could effectively test that, though.