Closed yzhang93 closed 1 year ago
Individual dispatches all seem to run ok (so may not be a codegen issue then ?) . CC @benvanik in case he has ideas on the runtime side too.
Another thing that I noticed is if I remove iree-preprocessing-convert-conv2d-to-img2col
from the preprocessing flag, the models run without problem.
@yzhang93 please update the full error (I can't update the first comment)
It errors with something like
RuntimeError: Error invoking function: <vm>:0: OK; failed to wait on timepoint;
[ 0] bytecode module.forward:134518 [
@PhaneeshB
Removing "needs-triage" since this has both priority and assignee.
@benvanik: Looking into this with some printf debugging in VM, I find something suspicious:
[module.forward+00000000] <block>
[module.forward+00000001] %i0 = vm.const.i32 30720 // 0x00007800
// Lots of VM dispatch happening, till the end...
[module.forward+00020348] %i0 = vm.call.varadic @hal.fence.await(%i45(4294967295), %r1(!hal.fence/0x0x55b33dba66b0))
[module.forward+0002035E] vm.cond_br %i0(0), ^000203A0(), ^0002036E()
[vm] current condition: 30720
[vm] true pc: 132000
[vm] false pc: 131950
[module.forward+000203A1] vm.fail %i0(0), "failed to wait on timepoint"
The call to @hal.fence.await
succeeds with rets->i0 = 0
. The following vm.cond_br
are actually not seeing %i0
being 0; rather it's reading the previous %i0
definition?
Here is the full vm-dispatch.log with -DCMAKE_C_FLAGS=-DIREE_VM_EXECUTION_TRACING_FORCE_ENABLE=1
.
hmm, this may be a bug, though I'm not quite sure how at first glance. spoooky.
[module.forward+00020348] %i0 = vm.call.varadic @hal.fence.await(%i45(4294967295), %r1(!hal.fence/0x0x561f11703920))
[module.forward+0002035E] vm.cond_br %i0(0), ^000203A0(), ^0002036E()
[module.forward+000203A1] vm.fail %i0(0), "failed to wait on timepoint"
The VM definitely knows that it was a zero result as %i0(0)
is printed in both and that is just reflecting the same value used to execute the condition op.
You could try printing out the address of regs->i32[condition_reg]
here: https://github.com/openxla/iree/blob/main/runtime/src/iree/vm/bytecode/disassembler.c#L1563
and regs_i32[OP_I16(0)]
here (first line, before anything else):
https://github.com/openxla/iree/blob/main/runtime/src/iree/vm/bytecode/dispatch.c#L1593
If they differ then I have a suspect, if they're the same then that's terrifying.
Yes they are different:
[dis] regs->i32[condition_reg] addr=0x557eaf28ea70
[module.forward+0002035E] vm.cond_br %i0(0), ^000203A0(), ^0002036E()
[vm] condition (regs_i32[OP_I16(0)]) addr: 0x7ffc8345ff38
[vm] current condition: 30720
[vm] true pc: 132000
[vm] false pc: 131950
@benvanik you said you have a suspect in this case?
The 0x7ffc8345ff38 address is sketchy - that's unlike all the other addresses - I'm guessing that's a pointer into the stack or some executable page, as all other heap pointers in that dump are 0x5.... - would be worth verifying it's printing the address of the register and not of the local regs_i32 variable or something.
[module.forward+00020348] %i0 = vm.call.varadic @hal.fence.await(%i45(4294967295), %r1(!hal.fence/0x0x561f11703920))
[module.forward+0002035E] vm.cond_br %i0(0), ^000203A0(), ^0002036E()
[module.forward+000203A1] vm.fail %i0(0), "failed to wait on timepoint"
EXEC @forward
<vm>:0: OK; failed to wait on timepoint;
This indicates that the VM sees the call return 0 and stores that in i0 and that it knows it's zero all the way down - that <vm>:0: OK
is because the vm.fail gets the 0 (IREE_STATUS_OK
) and is able to dump that out.
So something is causing just one read of int32_t condition = VM_DecOperandRegI32("condition");
to be non-zero, even though all subsequent reads (the disassembly, the vm.fail, etc) all see 0. I can't immediately explain that without it being something like memory corruption. Running with ASAN may show something.
With the latest python package (iree-compiler 20230328.472), I'm now getting this new error on Unet.
ValueError: Error invoking function: main_checkout/runtime/src/iree/vm/ref.h:181: INVALID_ARGUMENT; ref is null; while invoking native function hal.buffer.assert; while calling import;
[ 1] native hal.buffer.assert:0 -
[ 0] bytecode module.forward:8994 [unknown]
Aborted (core dumped)
Not sure if it's related to the old error, but showing as different message.
@benvanik: ASAN just shows a bunch of allocation leaks; not showing what the issue behind this. With more debugging, I have more findings:
We record regs.i32
and regs.ref
using local variables here:
Then later we can actually modify regs
when doing call
or call.variadic
:
So regs
can change but the local variable regs_i32
/ regs_ref
still points to the old value. So that causes inconsistency with regs.i32
/regs.ref
. (Though it looks like to me that this kinda inconsistency is expected? As we have explicit regs_i32
/regs_ref
management in call
and return
.
The reason of why disassembler is good is that we always pass regs
directly to dissembler so it's getting the updated value:
@benvanik: Any more thoughts?
Updating regs_32
and regs_ref
to regs.i32
and regs.ref
at the end of call.variadic
makes the timepoint waiting error going away; but I'm seeing NaN in the result. So I'm not sure this is the proper fix..
interesttinggggggg that's probably it - prior to some relatively recent changes (9 months ago... ouch) imports could never manipulate the stack - now they can, which can invalidate those pointers. I'll send a change in a second.
Sucks that it didn't fix the actual issue, though.
Confirmed that was the case and have #12809 to fix it. Thanks so much @antiagainst for triaging that!
@dan-garvey can you please take over this to verify if this fixes the problem after Ben's fix ? Lei seemed to notice NaNs after this fix but lets verify on ToM IREE
I also hit NaNs after this fix on vulkan. I'll try with CUDA tomorrow
I am hitting the same error @yzhang93 is hitting with the latest fixes:
ValueError: Error invoking function: C:\g\s\runtime\src\iree/vm/ref.h:184: INVALID_ARGUMENT; ref is null; while invoking native function hal.buffer.assert; while calling import;
[ 1] native hal.buffer.assert:0 -
[ 0] bytecode module.forward:9000 [unknown]
@antiagainst / @benvanik please advise if that seems related to the vmvx changes or is it something else we should be chasing
@dan-garvey: Thanks. Cross checking cuda/vulkan would be useful to help to identify the issue. Also if you can identify the first dispatch producing NaN that'd be nice.
@powderluv: Looks different. We should have separate tracking for it. (Though I'm confused that @dan-garvey reports NaN but you are seeing runtime errors? Are we talking about the model/target/etc.?)
I just tried with the latest IREE nightly, and the errors are gone on both rdna3 and A100 systems. I didn't see NaN and the generated image looks correct.
Okay, I don't know what have changed to cause everything to pass, but still, it's great! :) I'll close this issue for now. Please feel free to open new ones.
What happened?
Unet failed on some AMD rdna2/rdna3 and nvidia A100 devices, with error:
However, all the dumped dispatches ran without any problem.
Version information
iree-compile --iree-input-type=none --iree-hal-target-backends=vulkan --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs -iree-vulkan-target-triple=rdna2-7900-linux --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32}))' unet_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan/unet_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir -o unet.vmfb
iree-benchmark-module --module=unet.vmfb --function=forward --device=vulkan --input=1x4x64x64xf16 --input=1xf16 --input=2x64x1024xf16 --input=f32=1.0
Additional context
Also failed on A100, commands for cuda path are as following:
iree-compile --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=cuda --iree-llvmcpu-target-cpu-features=host --iree-hal-cuda-disable-loop-nounroll-wa --iree-hal-cuda-llvm-target-arch=sm_80 --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32}))' unet_1_64_512_512_fp16_stable-diffusion-2-1-base_cuda/unet_1_64_512_512_fp16_stable-diffusion-2-1-base_cuda_torch.mlir -o unet.vmfb
iree-benchmark-module --module=unet.vmfb --device=cuda --function=forward --input=1x4x64x64xf16 --input=1xf16 --input=2x64x1024xf16 --input=f32=1.0