Open busishengui opened 1 year ago
Can you try the latest version of ORT? We've not seen reports of this behavior. So, detailed instructions on how to repro will be required.
Using the code from the latest main today, I could not reproduce this issue.
seems similar with #2804 #10352
Hello, I'm a member of MaaAssistantArknights, and it occurs on our program as the same.
Onnxruntime version: 1.15.1 with prebuild https://github.com/microsoft/onnxruntime/releases/download/v1.15.1/onnxruntime-linux-x64-gpu-1.15.1.tgz
Exception:
terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'
what(): /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:121 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:114 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, onnxruntime::common::Status> = void] CUDA failure 4: driver shutting down ; GPU=2000772548 ; hostname=Cryolitia-nixos ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=99 ; expr=cudaFreeHost(p);
core dump:
#0 0x00007f31a856fd7c __pthread_kill_implementation (libc.so.6 + 0x8cd7c)
#1 0x00007f31a85209c6 raise (libc.so.6 + 0x3d9c6)
#2 0x00007f31a85098fa abort (libc.so.6 + 0x268fa)
#3 0x00007f31a56a9a89 _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold (libstdc++.so.6 + 0xa9a89)
#4 0x00007f31a56b4f8a _ZN10__cxxabiv111__terminateEPFvvE (libstdc++.so.6 + 0xb4f8a)
#5 0x00007f31a56b3ff9 __cxa_call_terminate (libstdc++.so.6 + 0xb3ff9)
#6 0x00007f31a56b4716 __gxx_personality_v0 (libstdc++.so.6 + 0xb4716)
#7 0x00007f31a87c2864 _Unwind_RaiseException_Phase2 (libgcc_s.so.1 + 0x17864)
#8 0x00007f31a87c32bd _Unwind_Resume (libgcc_s.so.1 + 0x182bd)
#9 0x00007f31134e1364 _ZN11onnxruntime8CudaCallI9cudaErrorLb1EEENSt11conditionalIXT0_EvNS_6common6StatusEE4typeET_PKcS9_S7_S9_S9_i (libonnxruntime_providers_cuda.so + 0xe1364)
#10 0x00007f31134dd91b _ZN11onnxruntime19CUDAPinnedAllocator4FreeEPv (libonnxruntime_providers_cuda.so + 0xdd91b)
#11 0x00007f31a7172d7d n/a (libonnxruntime.so.1.15.1 + 0x972d7d)
#12 0x00007f31a7172f3d n/a (libonnxruntime.so.1.15.1 + 0x972f3d)
#13 0x00007f31134eebe2 _ZN11onnxruntime21CUDAExecutionProviderD1Ev (libonnxruntime_providers_cuda.so + 0xeebe2)
#14 0x00007f31134eed1d _ZN11onnxruntime21CUDAExecutionProviderD0Ev (libonnxruntime_providers_cuda.so + 0xeed1d)
#15 0x00007f31a6a72b8a n/a (libonnxruntime.so.1.15.1 + 0x272b8a)
#16 0x00007f31a6a72d7d n/a (libonnxruntime.so.1.15.1 + 0x272d7d)
#17 0x00007f31a7b31ddd _ZN10fastdeploy10OrtBackendD1Ev (libMaaDerpLearning.so + 0x131ddd)
#18 0x00007f31a7b31e69 _ZN10fastdeploy10OrtBackendD0Ev (libMaaDerpLearning.so + 0x131e69)
#19 0x00007f31a7b27105 _ZN10fastdeploy7RuntimeD2Ev (libMaaDerpLearning.so + 0x127105)
#20 0x00007f31a7b273d2 _ZNSt15_Sp_counted_ptrIPN10fastdeploy7RuntimeELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv (libMaaDerpLearning.so + 0x1273d2)
#21 0x00007f31a8188859 _ZN10fastdeploy15FastDeployModelD1Ev (libMaaCore.so + 0x188859)
For more technical details:
fastdeploy::Runtime
.fastdeploy::Runtime
creates a Ort::Session
in https://github.com/MaaAssistantArknights/FastDeploy/blob/master/fastdeploy/backends/ort/ort_backend.ccdriver shutting down
Could it be caused by that, each Ort::Session
instance owns a instance of cuda driver but the cuda driver was shut down globally when the first instance destructed, and the second instance tries to shut down a already-shut-down cuda driver.
Meet the same problem. Program ends with:
terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException' what(): /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char) [with ERRTYPE = cudaError; bool THRW = true] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 4: driver shutting down ; GPU=-2130784471 ; hostname=dev-audioaihcb1 ; expr=cudaEventSynchronize(e);
onnxruntime version is onnxruntime-linux-x64-gpu-1.12.0
onnxruntime-linux-x64-gpu-1.16.3 meets the same problem.
Debugging with breakpoints on cudaFreeHost
and cudaMallocHost
Describe the issue
It works well when it run in GPU,but it has a bug when it terminates terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException' what(): /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char) [with ERRTYPE = cudaError; bool THRW = true] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char, const char, ERRTYPE, const char) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 4: driver shutting down ; GPU=806358777 ; hostname=lv-voice-rt-02 ; expr=cudaEventSynchronize(e);
To reproduce
Urgency
No response
Platform
Linux
OS Version
centos
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.12.1
ONNX Runtime API
C++
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.4