Open Djdefrag opened 1 month ago
Tagging @PatriceVignola @smk2007 @fdwr for visibility.
Same here on Windows, versions 1.16.0 to 1.17.3 work fine over multiple threads, however 1.18.0 gives Windows fatal exception: access violation
with the following stack trace produced by my own Windows SEH handler:
-----------
Caught unhandled exception...
-----------
Terminating from thread id 10152
Non-C++ exception:
Error: EXCEPTION_ACCESS_VIOLATION
Type: Read
Addr: 0x0
Trace:
40: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
39: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
38: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
37: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
36: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
35: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
34: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
33: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
32: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
31: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
30: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
29: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
28: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
27: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
26: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
25: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
24: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
23: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
22: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
21: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
20: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
19: ?: PyInit_onnxruntime_pybind11_state (onnxruntime_pybind11_state.pyd)
18: ?: pybind11::error_already_set::discard_as_unraisable (onnxruntime_pybind11_state.pyd)
17: ?: PyObject_MakeTpCall (python311.dll)
16: ?: PyObject_Vectorcall (python311.dll)
15: ?: PyEval_EvalFrameDefault (python311.dll)
14: ?: PyFunction_Vectorcall (python311.dll)
13: ?: PyFunction_Vectorcall (python311.dll)
12: ?: PyObject_CallObject (python311.dll)
11: ?: PyEval_EvalFrameDefault (python311.dll)
10: ?: PyFunction_Vectorcall (python311.dll)
9: ?: PyObject_CallObject (python311.dll)
8: ?: PyEval_EvalFrameDefault (python311.dll)
7: ?: PyFunction_Vectorcall (python311.dll)
6: ?: PyFunction_Vectorcall (python311.dll)
5: ?: PyObject_Call (python311.dll)
4: ?: PyInterpreterState_Delete (python311.dll)
3: ?: PyInterpreterState_Delete (python311.dll)
2: ?: recalloc (ucrtbase.dll)
1: ?: BaseThreadInitThunk (KERNEL32.DLL)
0: ?: RtlUserThreadStart (ntdll.dll)
We’ve noted the issue with GPU resource contention due to multiple threads. This usage pattern is not recommended as it makes multiple threads request all of the GPU resources, and can cause contention. Also, the allocator in python API (both CUDA and DML) is explicitly not thread safe because it initializes the allocator as a global singleton due it living outside of the session.
We’re investigating the recent failure and will address it. Meanwhile, please avoid this pattern to prevent GPU contention.
Hi @liuyunms
Sorry to bother, I'm currently using an InferenceSession per tread, but you say it shouldn't be used this way.
4 threds -> 4 inference session with same gpu
Do you mean to use the same InferenceSession in multiple threads? Is it possible?
4 threds -> 1 inference session with same gpu
@PatriceVignola @smk2007 @fdwr
Hi, sorry to bother, there are some news for this problem? Actually testing 1.18.1 and the problem is still present :(
Thank you
Describe the issue
With the new version 1.18 it seems that trying to use different InferenceSession using the same DirectML device, all threads remain stalled without giving any exception or error
To reproduce
Thread 1
Thread n (where n can be any number)
Urgency
No response
Platform
Windows
OS Version
10
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.18.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
1.18.0