iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.61k stars 588 forks source link

SDXL vae_decode model hangs and fails to execute at runtime #17834

Open IanNod opened 3 months ago

IanNod commented 3 months ago

What happened?

Compiled vmfb fails to execute at runtime. I see the memory allocated but no actual gpu execution appears to happen while the runtime seems to just hang indefinitely. Similar model has been run with changing just the batch size(both smaller and larger batch sizes) without issue but for some reason batch size 16 has this issue.

Steps to reproduce your issue

IR for model in question: https://sharkpublic.blob.core.windows.net/sharkpublic/ian/stable_diffusion_xl_base_1_0_bs16_1024x1024_fp16_vae_decode_decomp_.mlir

compiled via this command:

./build-release/tools/iree-compile --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx942 --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-rocm-waves-per-eu=2 --iree-flow-enable-aggressive-fusion --iree-codegen-llvmgpu-use-vector-distribution=true --iree-global-opt-propagate-transposes=true --iree-opt-const-eval=false --iree-opt-outer-dim-concat=true --iree-vm-target-truncate-unsupported-floats --iree-llvmgpu-enable-prefetch=true --iree-opt-data-tiling=false --iree-codegen-gpu-native-math-precision=true --iree-rocm-waves-per-eu=2 --iree-flow-inline-constants-max-byte-length=1 --iree-preprocessing-pass-pipeline="builtin.module(iree-preprocessing-transpose-convolution-pipeline, iree-global-opt-raise-special-ops, util.func(iree-preprocessing-pad-to-intrinsics))" --iree-codegen-transform-dialect-library=/models/stable_diffusion_fp16/turbine_sdxl/attention_and_matmul_spec_mfma.mlir /models/stable_diffusion_fp16/turbine_sdxl/stable_diffusion_xl_base_1_0_bs16_1024x1024_fp16_vae_decode_decomp_.mlir -o vae.vmfb

run via:

./tools/iree-benchmark-module --module=/models/stable_diffusion_fp16/turbine_sdxl/stable_diffusion_xl_base_1_0_bs16_1024x1024_fp16_vae_decode_decomp_gfx942.vmfb --parameters=model=/models/stable_diffusion_fp16/vae_decode_fp16.safetensors --device=hip --function
=main --input=16x4x128x128xf16

What component(s) does this issue relate to?

Runtime

Version information

9ffe4735c8ed54a622e40b9a16df37657c0417b4

Additional context

No response

ScottTodd commented 3 months ago

Misc debugging tips: https://iree.dev/developers/debugging/model-development/

Could try with Tracy attached or with --trace_execution to see where in the program it is hanging.

IanNod commented 3 months ago

Tracy just gave me empty profiles with no statistics. --trace_execution with iree-benchmark-module doesn't provide any additional information. Still just hangs after printing out the CPU caches.

ScottTodd commented 3 months ago

Do other backends (e.g. llvm-cpu --> local-sync or local-task) also hang?

IanNod commented 3 months ago

rocm immediately errors out with

iree/experimental/rocm/status_util.c:31: INTERNAL; rocm driver error 'hipErrorInvalidConfiguration' (9): invalid configuration argument; hipModuleLaunchKernel; while invoking native function hal.command_buffer.dispatch; while calling import;
[ 2]   native hal.command_buffer.dispatch:0 -
[ 1] bytecode compiled_vae.main$async:5258 <stdin>:512:12
[ 0] bytecode compiled_vae.main:62 <stdin>:142:3
Aborted (core dumped)

local-task at least looks to be executing on several threads but is very slow to complete if it isn't hanging.

benvanik commented 3 months ago

yeah, that's the problem - compiler is generating something that isn't compatible with your device

IanNod commented 3 months ago

Any idea why the compiler would generate something incompatible at batch size 16, but work for pretty much any other batch size?

benvanik commented 3 months ago

something for codegen folks - I'm guessing it's picking an invalid workgroup size which is derived from the problem size and 16 trips it over some limit

note that the reason this hangs at runtime with hip is that the HAL hip driver isn't propagating the async error back to the fences (not placing blame - just not a code path that has anything testing it in the CTS so it's likely got bugs - I know for a fact it leaks, at least)

bangtianliu commented 3 months ago

@IanNod Hi Ian, I am debugging this. Could you also provideattention_and_matmul_spec_mfma.mlirfor compilation and vae_decode_fp16.safetensors for execution, so I can replicate the error first.

bangtianliu commented 3 months ago

Batch size 1 is also hanging, so I will be taking a look first at that particular size. Should be the same error. Ian and Sai have requested as it is higher priority.

bangtianliu commented 3 months ago

Summary of Bug Findings after dividing the model into dispatches: