microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.48k stars 2.76k forks source link

onnxruntime-directml cause TDR #20094

Open Jay19751103 opened 3 months ago

Jay19751103 commented 3 months ago

Describe the issue

When the vram occupied full, system will be slow. Change following to force it enter to compute queue will not have TDR problem

auto is_feature_level_1_0_core = (feature_levels.MaxSupportedFeatureLevel == D3D_FEATURE_LEVEL_1_0_CORE); if (is_feature_level_1_0_core) { return D3D12_COMMAND_LIST_TYPE_COMPUTE; } return D3D12_COMMAND_LIST_TYPE_COMPUTE; // return D3D12_COMMAND_LIST_TYPE_DIRECT;

Checking document , it looks MCDM enter compute queue only. and the specified flag D3D12_COMMAND_QUEUE_FLAG_DISABLE_GPU_TIMEOUT not work. Could we change it to compute queue ?

Following is description from MS document D3D_FEATURE_LEVEL_1_0_CORE Value: (0x1000) Allows Microsoft Compute Driver Model (MCDM) devices to be used, or more feature-rich devices (such as traditional GPUs) that support a superset of the functionality. MCDM is the overall driver model for compute-only; it's a scaled-down peer of the larger scoped Windows Device Driver Model (WDDM).

To reproduce

Install stable diffusion with SDXL, currently it have bug 2nd run will recompile and cause system vram usage full. When 2nd generate running vae-decoder, it will cause TDR

Urgency

No response

Platform

Windows

OS Version

Windows 11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

9e19684944adfda4a414fc91a67259894fce2898

ONNX Runtime API

Python

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

No response

peterer0625 commented 3 months ago

The problem here is that - if the gpu task run on gfx queue is more than 2s, it would trigger TDR. I believe onnxruntime could not guarantee the tasks run on all GPUs less than 2s. Choose gfx queue is not a good idea for GPU. Especially iGPU. Disable TDR flag only works for Compute queue as we know. So, O think the decision should be improved.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.