microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.95k stars 2.81k forks source link

ONNXRuntime 1.18 crashing with TensorRT EP when dealing with big inputs #21001

Closed sansrem closed 3 weeks ago

sansrem commented 2 months ago

Describe the issue

Testing ONNXRuntime 1.18 with TensorRT EP either 10.0.1 or 8.5.3 Using directly the onnxruntime-linux-x64-gpu-1.18.0.tgz for the TensorRT 10.0.1 tests and recompiled OnnxRuntime 1.18 with TensorRT 8.5.3 for the TensorRT 8.5.3 tests.

With TensorRT 10.0.1 our model is crashing when dealing with 2 input images of 4K UHDTV (3840x2167) with this error in the shell
Error [Non-zero status code returned while running TRTKernel_graph_torch_jit_5378504288688145163_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_5378504288688145163_0_0' Status Message: TensorRT EP failed to create engine from network.] and this callstack

5 0x00007fc7f0c30cf0 in () at /lib64/libpthread.so.0

6 0x00007fbe6b9d8102 in onnxruntime::TensorrtExecutionProvider::CreateNodeComputeInfoFromGraph(onnxruntime::GraphViewer const&, onnxruntime::Node const&, std::unordered_map<std::cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned long, std::hash<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, unsigned long> > >&, std::unordered_map<std::cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned long, std::hash<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, unsigned long> > >&, std::vector<onnxruntime::NodeComputeInfo, std::allocator >&)::{lambda(void, OrtApi const, OrtKernelContext)#3}::operator()(void, OrtApi const, OrtKernelContext) const [clone .isra.2141] ()

at PATH/libonnxruntime_providers_tensorrt.so

7 0x00007fbe6b9dae50 in std::_Function_handler<onnxruntime::common::Status (void, OrtApi const, OrtKernelContext), onnxruntime::TensorrtExecutionProvider::CreateNodeComputeInfoFromGraph(onnxruntime::GraphViewer const&, onnxruntime::Node const&, std::unordered_map<std::cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned long, std::hash<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, unsigned long> > >&, std::unordered_map<std::cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned long, std::hash<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, unsigned long> > >&, std::vector<onnxruntime::NodeComputeInfo, std::allocator >&)::{lambda(void, OrtApi const, OrtKernelContext)#3}>::_M_invoke(std::_Any_data const&, void&&, OrtApi const&&, OrtKernelContext*&&) ()

at PATH/libonnxruntime_providers_tensorrt.so

8 0x00007fc7cd2923c1 in onnxruntime::FunctionKernel::Compute(onnxruntime::OpKernelContext*) const ()

at PATH/libonnxruntime.so.1.18.0

9 0x00007fc7cd33272f in onnxruntime::ExecuteKernel(onnxruntime::StreamExecutionContext&, unsigned long, unsigned long, bool const&, onnxruntime::SessionScope&) () at PATH/libonnxruntime.so.1.18.0

10 0x00007fc7cd32a5ef in onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) () at PATH/libonnxruntime.so.1.18.0

11 0x00007fc7cd335723 in onnxruntime::RunSince(unsigned long, onnxruntime::StreamExecutionContext&, onnxruntime::SessionScope&, bool const&, unsigned long) () at PATH/libonnxruntime.so.1.18.0

12 0x00007fc7cd3308d1 in onnxruntime::ExecuteThePlan(onnxruntime::SessionState const&, gsl::span<int const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<int const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::hash, std::equal_to, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection const*, bool const&, bool, bool) ()

at PATH/libonnxruntime.so.1.18.0

13 0x00007fc7cd303ccf in onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManage------T--T----T--Ty--T----Typ--Typ----Typ--Typ----Ty------T----T--T------T--------Type for more, q to quit, c to continue without paging--

r const&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)>, std::hash, std::equal_to, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtDevice const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollection, bool, onnxruntime::Stream) () at PATH/libonnxruntime.so.1.18.0

14 0x00007fc7cd30659c in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, onnxruntime::DeviceStreamCollectionHolder&, bool, onnxruntime::Stream*) ()

at PATH/libonnxruntime.so.1.18.0

15 0x00007fc7cd30696a in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, gsl::span<OrtValue const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator >&, ExecutionMode, OrtRunOptions const&, onnxruntime::DeviceStreamCollectionHolder&, onnxruntime::logging::Logger const&) () at PATH/libonnxruntime.so.1.18.0

16 0x00007fc7ccb5500a in onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, 18446744073709551615ul>, gsl::span<OrtValue const, 18446744073709551615ul>, gsl::span<std::cxx11::basic_string<char, std::char_traits, std::allocator > const, 18446744073709551615ul>, std::vector<OrtValue, std::allocator >, std::vector<OrtDevice, std::allocator > const) [clone .localalias.2030] () at PATH/libonnxruntime.so.1.18.0

17 0x00007fc7ccb558e0 in onnxruntime::InferenceSession::Run(OrtRunOptions const&, gsl::span<char const const, 18446744073709551615ul>, gsl::span<OrtValue const const, 18446744073709551615ul>, gsl::span<char const const, 18446744073709551615ul>, gsl::span<OrtValue, 18446744073709551615ul>) () at PATH/libonnxruntime.so.1.18.0

18 0x00007fc7ccae253c in OrtApis::Run(OrtSession, OrtRunOptions const, char const const, OrtValue const const, unsigned long, char const const, unsigned long, OrtValue**) ()

at PATH/libonnxruntime.so.1.18.0

If running the same model with ONNXRuntime 1.18 and TensorRT 8.5.3 it is fine with these inputs (3849x2167), still working with 6K (6531x3100) and it is crashing with 8K (7680x4320)

If running with TensorRT 10.0.1 on a machine with lower compute capability ( for example nvidia-smi --query-gpu=compute_cap --format=csv that returns 6.1 ) ONNXRuntime will crash with the same error/callstack with 2 HD images (1920x1080)

So here are the observations: 1- ONNXRuntime should not crash in all cases, it should return an error. 2- In our case going to TensorRT 10 is not an option as it crashes on older machines and it is unable to deal with the same image size than tensorRT 8.5.3

To reproduce

Use a model that takes big images as input in the TensorRT EP will make the software crash.

Urgency

No response

Platform

Linux

OS Version

Rocky Linux 8.7/9.3

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18

ONNX Runtime API

C++

Architecture

X86

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 11.8 TensorRT 10.0.1 or 8.5.3

chilo-ms commented 2 months ago

The error message "TensorRT EP failed to create engine from network" indicates something went wrong when TRT EP is calling TRT's api buildSerializedNetwork() and since it happens when dealing with large image, i'm suspecting it's due to OOM.

Could you increase the trt_max_workspace_size to see? The default is 1 GB.

Also, quick question, can you repro the issue using trtexec?

sansrem commented 2 months ago

Hi,

I tried with trt_max_workspace_sizehttps://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#trt_max_workspace_size set to 2G, 4G, 8G with the same result getting also this additional warning if it is set greater than 1G

024-06-13 07:26:15.840575541 [W:onnxruntime:CF, tensorrt_execution_provider.cc:1479 TensorrtExecutionProvider] [TensorRT EP] TensorRT option trt_max_workspace_size must be a positive integer value. Set it to 1073741824 (1GB)

Not really familiar with trtexec, I tried by just specifying the onnx model and it failed with

[06/13/2024-17:20:39] [E] [TRT] ModelImporter.cpp:732: ERROR: builtin_op_importers.cpp:4531 In function importSlice: [8] Assertion failed: (axes.allValuesKnown()) && "This version of TensorRT does not support dynamic axes." [06/13/2024-17:20:39] [E] Failed to parse onnx file [06/13/2024-17:20:39] [I] Finish parsing network model [06/13/2024-17:20:39] [E] Parsing model failed [06/13/2024-17:20:39] [E] Failed to create engine from model or file. [06/13/2024-17:20:39] [E] Engine set up failed

I used TensorRT 8.5.3 in this case.

From: Chi Lo @.> Sent: Wednesday, June 12, 2024 12:38 PM To: microsoft/onnxruntime @.> Cc: Mathieu Sansregret @.>; Author @.> Subject: Re: [microsoft/onnxruntime] ONNXRuntime 1.18 crashing with TensorRT EP when dealing with big inputs (Issue #21001)

EXTERNAL EMAIL : Do not click any links or open any attachments unless you trust the sender and know the content is safe.

The error message "TensorRT EP failed to create engine from network" indicates something went wrong when TRT EP is calling TRT's api buildSerializedNetwork() and since it happens when dealing with large image, i'm suspecting it's due to OOM.

Could you increate the trt_max_workspace_sizehttps://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#trt_max_workspace_size to see? The default is 1 GB.

Also, quick question, can you repro the issue using trtexec?

- Reply to this email directly, view it on GitHubhttps://github.com/microsoft/onnxruntime/issues/21001#issuecomment-2163475295, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG7CMVD4ST7IDP4Y3IDRPMLZHB2N3AVCNFSM6AAAAABJEWIWWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTGQ3TKMRZGU. You are receiving this because you authored the thread.Message ID: @.**@.>>

chilo-ms commented 2 months ago

I tried with trt_max_workspace_sizehttps://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#trt_max_workspace_size> set to 2G, 4G, 8G with the same result getting also this additional warning if it >is set greater than 1G

Hmm that's strange. Could you share the code that set trt_max_workspace_size? Please see the example code here.

As for trtexec, some models are not fully TRT eligible, it seems that's the case of your model, so trtexec won't be able to run them. How about trtexec with TRT 10? Could you share the proxy model so that we can repro from our side? Or could you point to public model that can repro the issue.

sansrem commented 2 months ago

Found the problem on my side for trt_max_workspace_size, re-validated with 2G, 4G and 8G

Still getting

2024-06-14 11:34:30.389829469 [W:onnxruntime:CF, tensorrt_execution_provider.h:84 log] [2024-06-14 15:34:30 WARNING] Skipping tactic 0x0000000000000000 due to exception autotuning: CUDA error 2 allocating 6370102777-byte buffer: out of memory 2024-06-14 11:34:30.480769226 [E:onnxruntime:CF, tensorrt_execution_provider.h:82 log] [2024-06-14 15:34:30 ERROR] 4: [optimizer.cpp::computeCosts::3726] Error Code 4: Internal Error (Could not find any implementation for node {ForeignNode[onnx::Cast_507[Constant]...Concat_372]} due to insufficient workspace. See verbose log for requested sizes.) 2024-06-14 11:34:30.520078719 [E:onnxruntime:CF, tensorrt_execution_provider.h:82 log] [2024-06-14 15:34:30 ERROR] 2: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. ) 2024-06-14 11:34:30.520215247 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_16074816800397161377_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_16074816800397161377_0_0' Status Message: TensorRT EP failed to create engine from network.

We already sent the model (FLMFRIFE_Untrained.onnx) to a member of the ONNXRuntime team : Scott McKay.

From: Chi Lo @.> Sent: Thursday, June 13, 2024 7:48 PM To: microsoft/onnxruntime @.> Cc: Mathieu Sansregret @.>; Author @.> Subject: Re: [microsoft/onnxruntime] ONNXRuntime 1.18 crashing with TensorRT EP when dealing with big inputs (Issue #21001)

EXTERNAL EMAIL : Do not click any links or open any attachments unless you trust the sender and know the content is safe.

I tried with trt_max_workspace_sizehttps://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#trt_max_workspace_size> set to 2G, 4G, 8G with the same result getting also this additional warning if it >is set greater than 1G

Hmm that's strange. Could you share the code that set trt_max_workspace_size? Please see the example code herehttps://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#click-below-for-c-api-example.

As for trtexec, some models are not fully TRT eligible, so trtexec won't be able to run them. Could you share the proxy model so that we can repro from our side? Or could you point to public model that can repro the issue.

- Reply to this email directly, view it on GitHubhttps://github.com/microsoft/onnxruntime/issues/21001#issuecomment-2166968204, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG7CMVGFQECYT4KO6ZR6OM3ZHIVTHAVCNFSM6AAAAABJEWIWWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRWHE3DQMRQGQ. You are receiving this because you authored the thread.Message ID: @.**@.>>

jywu-msft commented 2 months ago

we'll sync with @skottmckay to get the model

geraldstanje commented 2 months ago

what is trt_max_workspace_size ?

sansrem commented 2 months ago

With this code

static const size_t nbGig = getenv("ENV_TRT_WRKS_SZ") ? atoi(getenv("ENV_TRT_WRKS_SZ")) : 1; trto.trt_max_workspace_size = nbGig * 1073741824

I tried with ENV_TRT_WRKS_SZ = 1,2,4 and 8 with the same result.

From: geraldstanje @.> Sent: Thursday, June 20, 2024 1:16 PM To: microsoft/onnxruntime @.> Cc: Mathieu Sansregret @.>; Author @.> Subject: Re: [microsoft/onnxruntime] ONNXRuntime 1.18 crashing with TensorRT EP when dealing with big inputs (Issue #21001)

EXTERNAL EMAIL : Do not click any links or open any attachments unless you trust the sender and know the content is safe.

what is trt_max_workspace_size ?

- Reply to this email directly, view it on GitHubhttps://github.com/microsoft/onnxruntime/issues/21001#issuecomment-2181173098, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AG7CMVH54DGCUAHOMIVWZELZIME4JAVCNFSM6AAAAABJEWIWWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBRGE3TGMBZHA. You are receiving this because you authored the thread.Message ID: @.**@.>>

chilo-ms commented 2 months ago

what is trt_max_workspace_size ?

The value of trt_max_workspace_size will determine memory size limit of the memory pool. See TRT doc https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/namespacenvinfer1.html#a125336eeaa69c11d9aca0535449f0391 https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_builder_config.html#a0a88a9b43bbe47c839ba65de9b40779f

2024-06-14 11:34:30.480769226 [E:onnxruntime:CF, tensorrt_execution_provider.h:82 log] [2024-06-14 15:34:30 ERROR] 4: [optimizer.cpp::computeCosts::3726] Error Code 4: Internal Error (Could not find any implementation for node {ForeignNode[onnx::Cast_507[Constant]...Concat_372]} due to insufficient workspace.

The error message showed that "insufficient workspace". It seems 8G is not enough. Could you set the value to the GPU memory maximum size?

BTW, you can also monitor the GPU memory usage while running the inference to see memory consumption.

chilo-ms commented 2 months ago

i did get the model from Scott, but i encountered different issue which seems related to a Concat's axis attribute.

trtexec 10.0.1 and 8.6 -> This version of TensorRT does not support dynamic axes TRT EP -> Error Code 4: Miscellaneous (IConcatenationLayer Concat_75: Concat_75: axis 3 dimensions must be equal for concatenation on axis 1.)

Will check with Scott, or could you share the model again to make sure i'm using the same model as you?

geraldstanje commented 2 months ago

@chilo-ms trt_max_workspace_size depends on the gpu memory? The Nvidia T4 has 16 GB GDDR6 memory - so i can set 16 GB for trt_max_workspace_size?

chilo-ms commented 2 months ago

@chilo-ms trt_max_workspace_size depends on the gpu memory? The Nvidia T4 has 16 GB GDDR6 memory - so i can set 16 GB for trt_max_workspace_size?

yes, give it a try.

chilo-ms commented 1 month ago

Update here.

I saw similar OOM message when the workspace size is 2G when running input with 2 4K (1x6x3840x2176) Then i increased workspace size to 16G ('trt_max_workspace_size': 17179869184) and TRT EP can successfully run the model with 2 4K input.

jywu-msft commented 3 weeks ago

closing this since @chilo-ms provided last update on increasing workspace size.