microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.11k stars 2.84k forks source link

cudaMemcpyAsync throws exception in GPUDataTransfer #19076

Open laxnpander opened 8 months ago

laxnpander commented 8 months ago

Describe the issue

Hey all,

I have an issue running the following model: https://github.com/fabio-sim/LightGlue-ONNX More specific this onnx: https://github.com/fabio-sim/LightGlue-ONNX/releases/download/v1.0.0/superpoint_lightglue_end2end_fused.onnx

Verbose log: verbose_log.txt

CUDA throws an exception async copying data. According to the verbose log it always seems to happen at Kernel with idx 2478. The stack trace looks as follows:

<unknown> 0x00007fffd9970935
<unknown> 0x00007fffd9a5d86a
<unknown> 0x00007fffd9b914cb
<unknown> 0x00007fffd9b91d61
<unknown> 0x00007fffd9cb9130
<unknown> 0x00007fffd9931a33
<unknown> 0x00007fffd9931f41
<unknown> 0x00007fffd9932ea8
<unknown> 0x00007fffd9b000d1
<unknown> 0x00007fffdb644459
<unknown> 0x00007fffdb6176fd
cudaMemcpyAsync 0x00007fffdb6696a5
onnxruntime::GPUDataTransfer::CopyTensorAsync(onnxruntime::Tensor const&, onnxruntime::Tensor&, onnxruntime::Stream&) const 0x00007fff9fd1b0dd
onnxruntime::IDataTransfer::CopyTensors(std::vector<onnxruntime::IDataTransfer::SrcDstPair, std::allocator<onnxruntime::IDataTransfer::SrcDstPair> > const&) const 0x00007ffff6dbbe63
onnxruntime::ProviderHostImpl::IDataTransfer__CopyTensors(onnxruntime::IDataTransfer const*, std::vector<onnxruntime::IDataTransfer::SrcDstPair, std::allocator<onnxruntime::IDataTransfer::SrcDstPair> > const&) 0x00007ffff66406a8
onnxruntime::IDataTransfer::CopyTensors(std::vector<onnxruntime::IDataTransfer::SrcDstPair, std::allocator<onnxruntime::IDataTransfer::SrcDstPair> > const&) const 0x00007fff9ff35bc7
onnxruntime::DataTransferManager::CopyTensors(std::vector<onnxruntime::IDataTransfer::SrcDstPair, std::allocator<onnxruntime::IDataTransfer::SrcDstPair> > const&) const 0x00007ffff6dbf95d
onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection*, bool, onnxruntime::Stream*) 0x00007ffff6e65802
onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollectionHolder&, bool, onnxruntime::Stream*) 0x00007ffff6e66e8b
onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, OrtRunOptions const&, onnxruntime::DeviceStreamCollectionHolder&, onnxruntime::logging::Logger const&) 0x00007ffff6e671f3
onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator<OrtValue> >*, std::vector<OrtDevice, std::allocator<OrtDevice> > const*) [clone .localalias] 0x00007ffff668ac8a
onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue const* const, 18446744073709551615ul>, gsl::span<char const* const, 18446744073709551615ul>, gsl::span<OrtValue*, 18446744073709551615ul>) 0x00007ffff668bab2
OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) 0x00007ffff6613fff
Ort::detail::SessionImpl::Run onnxruntime_cxx_inline.h:967
spear::ort::Inference::run Inference.h:314
main superpoint_lightglue_main.cpp:67
__libc_start_call_main 0x00007ffff5c29d90
__libc_start_main_impl 0x00007ffff5c29e40
_start 0x0000555555558d55

Test report also throws some errors. Not sure if it is related:

[----------] Global test environment tear-down
[==========] 3957 tests from 279 test suites ran. (260820 ms total)
[  PASSED  ] 3935 tests.
[  SKIPPED ] 2 tests, listed below:
[  SKIPPED ] MatMulFpQ4.MatMul2DSym
[  SKIPPED ] MatMulFpQ4.MatMul2DBlkZp
[  FAILED  ] 20 tests, listed below:
[  FAILED  ] QOrderedTest.Attention_WithData_ROW_ORDER
[  FAILED  ] QOrderedTest.LongformerAttention_1x128x2x16_window_32
[  FAILED  ] QOrderedTest.MatMul_COL_16x64x32
[  FAILED  ] QOrderedTest.MatMul_COL_16x64x32_perchannel
[  FAILED  ] QOrderedTest.MatMul_bias_COL_16x64x32
[  FAILED  ] QOrderedTest.MatMul_bias_COL_16x64x32_perchannel
[  FAILED  ] QOrderedTest.MatMul_addC_COL_16x64x32
[  FAILED  ] QOrderedTest.MatMul_addC_COL_16x64x32_perchannel
[  FAILED  ] QOrderedTest.MatMul_bias_addC_COL_16x64x32
[  FAILED  ] QOrderedTest.MatMul_bias_addC_COL_16x64x32_perchannel
[  FAILED  ] QOrderedTest.MatMul_COL_16x64x32_b3_1
[  FAILED  ] QOrderedTest.MatMul_bias_COL_16x64x32_b2_1
[  FAILED  ] QOrderedTest.MatMul_bias_COL_16x64x32_b2_1_perchannel
[  FAILED  ] QOrderedTest.MatMul_addC_COL_16x64x32_b2_1
[  FAILED  ] QOrderedTest.MatMul_addC_COL_16x64x32_b2_1_perchannel
[  FAILED  ] QOrderedTest.MatMul_addC_broadcastC_COL_16x64x32_b2_1
[  FAILED  ] QOrderedTest.MatMul_addC_bias_COL_16x64x32_b2_1
[  FAILED  ] QOrderedTest.MatMul_addC_bias_COL_16x64x32_b2_1_perchannel
[  FAILED  ] QOrderedTest.MatMul_bias_addC_broadcastC_COL_16x64x32_b2_1
[  FAILED  ] QOrderedTest.MatMul_bias_addC_broadcastC_COL_16x64x32_b2_1_perchannel

To reproduce

Load model into onnxruntime, set two images as input, run the inference in C++.

Urgency

No

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

16.3

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

xadupre commented 8 months ago

Since the unit test are failing, it does not seem related to your model. Maybe the compilation option you are using does not work for your machine? Can you share the machine specifications and your command line?

laxnpander commented 8 months ago

@xadupre Jeah sure, I am on Linux Ubuntu 22.04. My build command is as follows:

./build.sh --config Release --use_cuda --cudnn_home /usr/local/cuda --cuda_home /usr/local/cuda --build_shared_lib --skip_tests

My first build was with CUDA 12.3, the latest Nvidia driver for the setup (I think 545) and onnxruntime 1.17 straight from source. The error was as shown above. Then I thought maybe the whole setup is just too recent, so I rolled back to nvidia 525, CUDA 11.8 and onnxruntime 1.16. But the result is the same unfortunately. Not really sure where to start debugging. Any hints?

yuslepukhin commented 7 months ago

cuda-memcheck comes to mind, see if any race comes to surface.

xadupre commented 7 months ago

I would try to compile in Debug mode to see if the crash still appear. If it is, it should be easier to get the exact line causing the crash or if an error is detected before the crash.

laxnpander commented 7 months ago

I would try to compile in Debug mode to see if the crash still appear. If it is, it should be easier to get the exact line causing the crash or if an error is detected before the crash.

That...sounds completely obvious, haha. Here is a stack trace of the debug build:

__pthread_kill_implementation 0x00007ffff4e969fc
__pthread_kill_internal 0x00007ffff4e969fc
__GI___pthread_kill 0x00007ffff4e969fc
__GI_raise 0x00007ffff4e42476
__GI_abort 0x00007ffff4e287f3
__assert_fail_base 0x00007ffff4e2871b
__GI___assert_fail 0x00007ffff4e39e96
onnxruntime::PlannerImpl::DecrementUseCount allocation_planner.cc:237
onnxruntime::PlannerImpl::ComputeSingleStreamReusePlan allocation_planner.cc:1443
onnxruntime::PlannerImpl::ComputeReusePlan allocation_planner.cc:1323
onnxruntime::PlannerImpl::CreatePlan allocation_planner.cc:2141
onnxruntime::SequentialPlanner::CreatePlan allocation_planner.cc:2198
onnxruntime::SessionState::FinalizeSessionStateImpl session_state.cc:1403
onnxruntime::SessionState::FinalizeSessionState session_state.cc:1186
onnxruntime::InferenceSession::Initialize inference_session.cc:1714
InitializeSession onnxruntime_c_api.cc:764
OrtApis::CreateSession onnxruntime_c_api.cc:780
Ort::Session::Session onnxruntime_cxx_inline.h:1020
__gnu_cxx::new_allocator::construct<…> new_allocator.h:162
std::allocator_traits::construct<…> alloc_traits.h:516
std::_Sp_counted_ptr_inplace::_Sp_counted_ptr_inplace<…> shared_ptr_base.h:519
std::__shared_count::__shared_count<…> shared_ptr_base.h:650
std::__shared_ptr::__shared_ptr<…> shared_ptr_base.h:1342
std::shared_ptr::shared_ptr<…> shared_ptr.h:409
std::allocate_shared<…> shared_ptr.h:863
std::make_shared<…> shared_ptr.h:879
spear::ort::Inference::loadOnnxNetwork Inference.h:128
spear::ort::Inference::Inference Inference.h:96
main superpoint_lightglue_main.cpp:37
__libc_start_call_main 0x00007ffff4e29d90
__libc_start_main_impl 0x00007ffff4e29e40
_start 0x0000555555558d15

So apparently there is already an issue during session creation.