Open IanNod opened 3 months ago
Misc debugging tips: https://iree.dev/developers/debugging/model-development/
Could try with Tracy attached or with --trace_execution
to see where in the program it is hanging.
Tracy just gave me empty profiles with no statistics. --trace_execution
with iree-benchmark-module doesn't provide any additional information. Still just hangs after printing out the CPU caches.
Do other backends (e.g. llvm-cpu --> local-sync or local-task) also hang?
rocm immediately errors out with
iree/experimental/rocm/status_util.c:31: INTERNAL; rocm driver error 'hipErrorInvalidConfiguration' (9): invalid configuration argument; hipModuleLaunchKernel; while invoking native function hal.command_buffer.dispatch; while calling import;
[ 2] native hal.command_buffer.dispatch:0 -
[ 1] bytecode compiled_vae.main$async:5258 <stdin>:512:12
[ 0] bytecode compiled_vae.main:62 <stdin>:142:3
Aborted (core dumped)
local-task at least looks to be executing on several threads but is very slow to complete if it isn't hanging.
yeah, that's the problem - compiler is generating something that isn't compatible with your device
Any idea why the compiler would generate something incompatible at batch size 16, but work for pretty much any other batch size?
something for codegen folks - I'm guessing it's picking an invalid workgroup size which is derived from the problem size and 16 trips it over some limit
note that the reason this hangs at runtime with hip is that the HAL hip driver isn't propagating the async error back to the fences (not placing blame - just not a code path that has anything testing it in the CTS so it's likely got bugs - I know for a fact it leaks, at least)
@IanNod Hi Ian, I am debugging this. Could you also provideattention_and_matmul_spec_mfma.mlir
for compilation and vae_decode_fp16.safetensors
for execution, so I can replicate the error first.
Batch size 1 is also hanging, so I will be taking a look first at that particular size. Should be the same error. Ian and Sai have requested as it is higher priority.
Summary of Bug Findings after dividing the model into dispatches:
Hanging Bug: One dispatch hangs on the HIP backend. When switching to the ROCm backend, it encounters the ROCm driver error "hipErrorInvalidConfiguration."
No Kernel Image Available for Execution on Device: This issue primarily affects the Conv2D kernel and occurs on both ROCm and HIP backends.
Attempted to Access Memory Beyond the Largest Legal Address: This error is present on both ROCm and HIP backends.
Out of Memory Allocation: There is an issue when trying to allocate a 135GB buffer. However, this problem is resolved when switching to the ROCm backend.
What happened?
Compiled vmfb fails to execute at runtime. I see the memory allocated but no actual gpu execution appears to happen while the runtime seems to just hang indefinitely. Similar model has been run with changing just the batch size(both smaller and larger batch sizes) without issue but for some reason batch size 16 has this issue.
Steps to reproduce your issue
IR for model in question: https://sharkpublic.blob.core.windows.net/sharkpublic/ian/stable_diffusion_xl_base_1_0_bs16_1024x1024_fp16_vae_decode_decomp_.mlir
compiled via this command:
run via:
What component(s) does this issue relate to?
Runtime
Version information
9ffe4735c8ed54a622e40b9a16df37657c0417b4
Additional context
No response