iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.83k stars 611 forks source link

Vulkan/CUDA runtime error: failed to wait on timepoint #12670

Closed yzhang93 closed 1 year ago

yzhang93 commented 1 year ago

What happened?

Unet failed on some AMD rdna2/rdna3 and nvidia A100 devices, with error:

<vm>:0: OK; failed to wait on timepoint; 
[ 0] bytecode module.forward:131934 [
    <eval_with_key>.9:5042:14,
    <eval_with_key>.9:5039:15,
    <eval_with_key>.9:5038:15,

However, all the dumped dispatches ran without any problem.

Version information

  1. Download the latest Unet model
  2. Compile command for Vulkan iree-compile --iree-input-type=none --iree-hal-target-backends=vulkan --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs -iree-vulkan-target-triple=rdna2-7900-linux --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32}))' unet_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan/unet_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir -o unet.vmfb
  3. Benchmark command for Vulkan iree-benchmark-module --module=unet.vmfb --function=forward --device=vulkan --input=1x4x64x64xf16 --input=1xf16 --input=2x64x1024xf16 --input=f32=1.0

Additional context

Also failed on A100, commands for cuda path are as following: iree-compile --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=cuda --iree-llvmcpu-target-cpu-features=host --iree-hal-cuda-disable-loop-nounroll-wa --iree-hal-cuda-llvm-target-arch=sm_80 --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32}))' unet_1_64_512_512_fp16_stable-diffusion-2-1-base_cuda/unet_1_64_512_512_fp16_stable-diffusion-2-1-base_cuda_torch.mlir -o unet.vmfb

iree-benchmark-module --module=unet.vmfb --device=cuda --function=forward --input=1x4x64x64xf16 --input=1xf16 --input=2x64x1024xf16 --input=f32=1.0

powderluv commented 1 year ago

Individual dispatches all seem to run ok (so may not be a codegen issue then ?) . CC @benvanik in case he has ideas on the runtime side too.

yzhang93 commented 1 year ago

Another thing that I noticed is if I remove iree-preprocessing-convert-conv2d-to-img2col from the preprocessing flag, the models run without problem.

powderluv commented 1 year ago

@yzhang93 please update the full error (I can't update the first comment)

It errors with something like

RuntimeError: Error invoking function: <vm>:0: OK; failed to wait on timepoint;
[ 0] bytecode module.forward:134518 [
powderluv commented 1 year ago

@PhaneeshB

aaron-schneider commented 1 year ago

Removing "needs-triage" since this has both priority and assignee.

antiagainst commented 1 year ago

@benvanik: Looking into this with some printf debugging in VM, I find something suspicious:

[module.forward+00000000]    <block>
[module.forward+00000001]    %i0 = vm.const.i32 30720  // 0x00007800

// Lots of VM dispatch happening, till the end...

[module.forward+00020348]    %i0 = vm.call.varadic @hal.fence.await(%i45(4294967295), %r1(!hal.fence/0x0x55b33dba66b0))
[module.forward+0002035E]    vm.cond_br %i0(0), ^000203A0(), ^0002036E()
[vm]  current condition: 30720
[vm]  true pc: 132000
[vm]  false pc: 131950
[module.forward+000203A1]    vm.fail %i0(0), "failed to wait on timepoint"

The call to @hal.fence.await succeeds with rets->i0 = 0. The following vm.cond_br are actually not seeing %i0 being 0; rather it's reading the previous %i0 definition?

antiagainst commented 1 year ago

Here is the full vm-dispatch.log with -DCMAKE_C_FLAGS=-DIREE_VM_EXECUTION_TRACING_FORCE_ENABLE=1.

benvanik commented 1 year ago

hmm, this may be a bug, though I'm not quite sure how at first glance. spoooky.

[module.forward+00020348]    %i0 = vm.call.varadic @hal.fence.await(%i45(4294967295), %r1(!hal.fence/0x0x561f11703920))
[module.forward+0002035E]    vm.cond_br %i0(0), ^000203A0(), ^0002036E()
[module.forward+000203A1]    vm.fail %i0(0), "failed to wait on timepoint"

The VM definitely knows that it was a zero result as %i0(0) is printed in both and that is just reflecting the same value used to execute the condition op.

You could try printing out the address of regs->i32[condition_reg] here: https://github.com/openxla/iree/blob/main/runtime/src/iree/vm/bytecode/disassembler.c#L1563 and regs_i32[OP_I16(0)] here (first line, before anything else): https://github.com/openxla/iree/blob/main/runtime/src/iree/vm/bytecode/dispatch.c#L1593

If they differ then I have a suspect, if they're the same then that's terrifying.

antiagainst commented 1 year ago

Yes they are different:

[dis]  regs->i32[condition_reg] addr=0x557eaf28ea70
[module.forward+0002035E]    vm.cond_br %i0(0), ^000203A0(), ^0002036E()
[vm]  condition (regs_i32[OP_I16(0)]) addr: 0x7ffc8345ff38
[vm]  current condition: 30720
[vm]  true pc: 132000
[vm]  false pc: 131950
dan-garvey commented 1 year ago

@benvanik you said you have a suspect in this case?

benvanik commented 1 year ago

The 0x7ffc8345ff38 address is sketchy - that's unlike all the other addresses - I'm guessing that's a pointer into the stack or some executable page, as all other heap pointers in that dump are 0x5.... - would be worth verifying it's printing the address of the register and not of the local regs_i32 variable or something.

[module.forward+00020348]    %i0 = vm.call.varadic @hal.fence.await(%i45(4294967295), %r1(!hal.fence/0x0x561f11703920))
[module.forward+0002035E]    vm.cond_br %i0(0), ^000203A0(), ^0002036E()
[module.forward+000203A1]    vm.fail %i0(0), "failed to wait on timepoint"
EXEC @forward
<vm>:0: OK; failed to wait on timepoint; 

This indicates that the VM sees the call return 0 and stores that in i0 and that it knows it's zero all the way down - that <vm>:0: OK is because the vm.fail gets the 0 (IREE_STATUS_OK) and is able to dump that out.

So something is causing just one read of int32_t condition = VM_DecOperandRegI32("condition"); to be non-zero, even though all subsequent reads (the disassembly, the vm.fail, etc) all see 0. I can't immediately explain that without it being something like memory corruption. Running with ASAN may show something.

yzhang93 commented 1 year ago

With the latest python package (iree-compiler 20230328.472), I'm now getting this new error on Unet.

ValueError: Error invoking function: main_checkout/runtime/src/iree/vm/ref.h:181: INVALID_ARGUMENT; ref is null; while invoking native function hal.buffer.assert; while calling import; 
[ 1]   native hal.buffer.assert:0 -
[ 0] bytecode module.forward:8994 [unknown]
Aborted (core dumped)

Not sure if it's related to the old error, but showing as different message.

antiagainst commented 1 year ago

@benvanik: ASAN just shows a bunch of allocation leaks; not showing what the issue behind this. With more debugging, I have more findings:

We record regs.i32 and regs.ref using local variables here:

https://github.com/openxla/iree/blob/0635b096ce714bdd94d829a4909db4e1cba42b6d/runtime/src/iree/vm/bytecode/dispatch.c#L679-L682

Then later we can actually modify regs when doing call or call.variadic:

https://github.com/openxla/iree/blob/0635b096ce714bdd94d829a4909db4e1cba42b6d/runtime/src/iree/vm/bytecode/dispatch.c#L1671-L1674

So regs can change but the local variable regs_i32 / regs_ref still points to the old value. So that causes inconsistency with regs.i32/regs.ref. (Though it looks like to me that this kinda inconsistency is expected? As we have explicit regs_i32/regs_ref management in call and return.

The reason of why disassembler is good is that we always pass regs directly to dissembler so it's getting the updated value:

https://github.com/openxla/iree/blob/0635b096ce714bdd94d829a4909db4e1cba42b6d/runtime/src/iree/vm/bytecode/dispatch_util.h#L109-L113

@benvanik: Any more thoughts?

antiagainst commented 1 year ago

Updating regs_32 and regs_ref to regs.i32 and regs.ref at the end of call.variadic makes the timepoint waiting error going away; but I'm seeing NaN in the result. So I'm not sure this is the proper fix..

benvanik commented 1 year ago

interesttinggggggg that's probably it - prior to some relatively recent changes (9 months ago... ouch) imports could never manipulate the stack - now they can, which can invalidate those pointers. I'll send a change in a second.

Sucks that it didn't fix the actual issue, though.

benvanik commented 1 year ago

Confirmed that was the case and have #12809 to fix it. Thanks so much @antiagainst for triaging that!

powderluv commented 1 year ago

@dan-garvey can you please take over this to verify if this fixes the problem after Ben's fix ? Lei seemed to notice NaNs after this fix but lets verify on ToM IREE

dan-garvey commented 1 year ago

I also hit NaNs after this fix on vulkan. I'll try with CUDA tomorrow

powderluv commented 1 year ago

I am hitting the same error @yzhang93 is hitting with the latest fixes:

ValueError: Error invoking function: C:\g\s\runtime\src\iree/vm/ref.h:184: INVALID_ARGUMENT; ref is null; while invoking native function hal.buffer.assert; while calling import;
[ 1]   native hal.buffer.assert:0 -
[ 0] bytecode module.forward:9000 [unknown]

@antiagainst / @benvanik please advise if that seems related to the vmvx changes or is it something else we should be chasing

antiagainst commented 1 year ago

@dan-garvey: Thanks. Cross checking cuda/vulkan would be useful to help to identify the issue. Also if you can identify the first dispatch producing NaN that'd be nice.

@powderluv: Looks different. We should have separate tracking for it. (Though I'm confused that @dan-garvey reports NaN but you are seeing runtime errors? Are we talking about the model/target/etc.?)

yzhang93 commented 1 year ago

I just tried with the latest IREE nightly, and the errors are gone on both rdna3 and A100 systems. I didn't see NaN and the generated image looks correct.

antiagainst commented 1 year ago

Okay, I don't know what have changed to cause everything to pass, but still, it's great! :) I'll close this issue for now. Please feel free to open new ones.