microsoft / Olive

Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.
https://microsoft.github.io/Olive/
MIT License
1.51k stars 159 forks source link

[Bug]: Optimization of Unet fails 6950 XT #517

Open captroper opened 1 year ago

captroper commented 1 year ago

What happened?

This appeared to me to be the same issue as 510 and 301, though I know nothing. I ran the following commands:

I've attached the log, as well as a DXDIAG, but it errors out when optimizing unet saying "failed to run olive on gpu-dml".... "887a0006 the gpu will not respond to more commands".

DxDiag.txt ErrorLog.txt

Version?

0.3.1

guotuofeng commented 1 year ago

The following error message seems be related to DirectML EP.

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\ExecutionProvider.cpp(896)\onnxruntime_pybind11_state.pyd!00007FFE31C80201: (caller: 00007FFE31C80C2F) Exception(2) tid(3c14) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.

guotuofeng commented 12 months ago

@jstoecker, do you have any insight?

guotuofeng commented 12 months ago

seems similar with https://github.com/microsoft/Olive/issues/510

jstoecker commented 11 months ago

This is DXGI_ERROR_DEVICE_HUNG during inference/evaluation, which typically happens when some GPU work is taking excessively long. The recent AMD driver optimizations for stable diffusion / multi-head attention target the RDNA 3 architecture (e.g., the 7000 series, like the Radeon RX 7900 XTX) but not the RDNA 2 (6000 series). Still, we can try to repro this on an RDNA card to see if anything jumps out.

CellerX commented 11 months ago

6800xt has same err

vibbix commented 9 months ago

Error on my 6900XT as well, on 0.4.0

Jerry-zirui commented 2 months ago

Same Error occurred in AMD Ryzen 7 7840U w/ Radeon 780M Graphics. I increased the dedicated GPU memory as #510 mentioned, but the error still.

Jay19751103 commented 1 month ago

GPU queue dose not disable TDR. https://github.com/microsoft/onnxruntime/issues/20094 User can manually disable this TdrLevel to test again. Check this to set it. https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys We have measured that sometimes it have many jobs in a command list no matter convert the file or run the model in lower end GPUs or large model Remember that enlarge your virtual memory to prevent memory not enough. 200GB is better for SDXL