microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.77k stars 2.94k forks source link

[Build] Issues with Multithreading in the New Versions of onnxruntime-directml #22867

Open lianshiye0 opened 4 days ago

lianshiye0 commented 4 days ago

Describe the issue

Issue Description:

In versions 1.17.0 and earlier of onnxruntime-directml, when using an AMD GPU and the onnxruntime.InferenceSession() method to load an ONNX model onto the GPU, a model session is created. If the program utilizes multithreading, multiple threads may compete for the model session, leading to deadlocks and crashes. Implementing a queue mechanism to avoid resource contention resolves the issue in these versions.

However, from version 1.18.0 onwards, despite using various mechanisms such as queueing, locks, and thread semaphores to limit resource contention in a multithreaded environment, these solutions have no effect. The problem persists, resulting in deadlocks and crashes.

Steps to Reproduce:

Use an AMD GPU.

Load an ONNX model using onnxruntime.InferenceSession() in a multithreaded program.

Observe deadlocks and crashes due to multiple threads competing for the model session.

Implement queueing, locks, and thread semaphores to manage resource contention.

Observe that these mechanisms do not resolve the issue in versions 1.18.0 and later.

Expected Behavior: Multithreading mechanisms should effectively manage resource contention, preventing deadlocks and crashes.

Actual Behavior: Resource contention management mechanisms are ineffective in versions 1.18.0 and later, resulting in persistent deadlocks and crashes.

Environment:

ONNX Runtime DirectML Versions: 1.17.0 and earlier (issue resolved with queueing), 1.18.0 and later (issue persists)

Hardware: AMD GPU

Operating System: (windows10 or windows11)

Request for Assistance: Given my observations, there seems to be a resource contention issue, but I am not entirely certain of the underlying cause. Could you provide guidance or solutions for resolving this issue in the newer versions of onnxruntime-directml?

Urgency

No response

Target platform

windows10 or windows11

Build script

session= onnxruntime.InferenceSession(onnx_model_path, providers= ['DmlExecutionProvider', 'CPUExecutionProvider'])

Error / output

The program deadlocks and crashes without generating any error messages or logs.

Visual Studio Version

No response

GCC / Compiler Version

No response

lianshiye0 commented 4 days ago

session= onnxruntime 的InferenceSession(onnx_model_path, providers= ['DmlExecutionProvider', 'CPUExecutionProvider'] "After loading the model onto the GPU, the issue of crashing occurs when calling session.run()."