Open Jay19751103 opened 3 months ago
The problem here is that - if the gpu task run on gfx queue is more than 2s, it would trigger TDR. I believe onnxruntime could not guarantee the tasks run on all GPUs less than 2s. Choose gfx queue is not a good idea for GPU. Especially iGPU. Disable TDR flag only works for Compute queue as we know. So, O think the decision should be improved.
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the issue
When the vram occupied full, system will be slow. Change following to force it enter to compute queue will not have TDR problem
auto is_feature_level_1_0_core = (feature_levels.MaxSupportedFeatureLevel == D3D_FEATURE_LEVEL_1_0_CORE); if (is_feature_level_1_0_core) { return D3D12_COMMAND_LIST_TYPE_COMPUTE; } return D3D12_COMMAND_LIST_TYPE_COMPUTE; // return D3D12_COMMAND_LIST_TYPE_DIRECT;
Checking document , it looks MCDM enter compute queue only. and the specified flag D3D12_COMMAND_QUEUE_FLAG_DISABLE_GPU_TIMEOUT not work. Could we change it to compute queue ?
Following is description from MS document D3D_FEATURE_LEVEL_1_0_CORE Value: (0x1000) Allows Microsoft Compute Driver Model (MCDM) devices to be used, or more feature-rich devices (such as traditional GPUs) that support a superset of the functionality. MCDM is the overall driver model for compute-only; it's a scaled-down peer of the larger scoped Windows Device Driver Model (WDDM).
To reproduce
Install stable diffusion with SDXL, currently it have bug 2nd run will recompile and cause system vram usage full. When 2nd generate running vae-decoder, it will cause TDR
Urgency
No response
Platform
Windows
OS Version
Windows 11
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
9e19684944adfda4a414fc91a67259894fce2898
ONNX Runtime API
Python
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
No response