Describe the bug
HI, I am trying to use onnxruntime for execution on the jetson TX2, but encountered challenging issue in performances.
I am trying to deploy a siamrpn Network on the jetson and couldn't use tensorrt for this since Nvidia doesn't support one of the layer. onnxruntime does but the performance are not as expected.
DESCRIPTION OF USAGE
I build onnxruntime with cuda support and used the C++ / C onnxruntime API for inference.
The IO are bound to GPU memory using:
auto memory_info = Ort::MemoryInfo("Cuda", OrtAllocatorType::OrtArenaAllocator, 0, OrtMemTypeDefault);
and passing this to
Ort::Value::CreateTensor(memory_info, ...)
I used a OrtCudaProviderOptions and filled it with the important information(device_id, arena_extended_strategy, cudnn...algosearch::EXHAUSTIVE)
add it to the session options
session_options.AppendExecutionProvider_CUDA(options);
and runs it with a call to
void Run(const RunOptions& run_options, const char* const* input_names, const Value* input_values, size_t input_count, const char* const* output_names, Value* output_values, size_t output_count);
input and output buffers are allocated on the GPU and execution complete
PROBLEM OBSERVED
The execution time is very slow (a little slower than CPU) to debug this I used the onnxruntime profiller and observed this
I ignored the first iteration where some initial caching may occurs but what we can see is on all the 9 next operation, execution seems fast, but the model_run phase continue and lock execution for a long time
538ms for model_run
8ms for SequentialExecutor::Execute
I can't manage to find the reason why this is taking so much time. after the SequentialExecutor complete
After verification all operations seems to occur on GPU, input and output shall be on GPU
Do you have any idea of what could be going wrong in this scenario ?
Urgency
Locking execution and deployment due to performance issue
System information
Linux4Tegra
ONNX Runtime from source with CUDA support 1.7.0:
Python version: 3.6
GCC 7.5.0
CUDA 10.2 /cuDNN 8.0 version:
GPU model and memory: jetson (nvidia Pascal 256 core)
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the bug HI, I am trying to use onnxruntime for execution on the jetson TX2, but encountered challenging issue in performances. I am trying to deploy a siamrpn Network on the jetson and couldn't use tensorrt for this since Nvidia doesn't support one of the layer. onnxruntime does but the performance are not as expected.
I build onnxruntime with cuda support and used the C++ / C onnxruntime API for inference. The IO are bound to GPU memory using:
auto memory_info = Ort::MemoryInfo("Cuda", OrtAllocatorType::OrtArenaAllocator, 0, OrtMemTypeDefault);
and passing this toOrt::Value::CreateTensor(memory_info, ...)
I used a OrtCudaProviderOptions and filled it with the important information(device_id, arena_extended_strategy, cudnn...algosearch::EXHAUSTIVE) add it to the session options
session_options.AppendExecutionProvider_CUDA(options);
and runs it with a call tovoid Run(const RunOptions& run_options, const char* const* input_names, const Value* input_values, size_t input_count, const char* const* output_names, Value* output_values, size_t output_count);
input and output buffers are allocated on the GPU and execution complete
PROBLEM OBSERVED The execution time is very slow (a little slower than CPU) to debug this I used the onnxruntime profiller and observed this
I ignored the first iteration where some initial caching may occurs but what we can see is on all the 9 next operation, execution seems fast, but the model_run phase continue and lock execution for a long time 538ms for model_run 8ms for SequentialExecutor::Execute
I can't manage to find the reason why this is taking so much time. after the SequentialExecutor complete After verification all operations seems to occur on GPU, input and output shall be on GPU
Do you have any idea of what could be going wrong in this scenario ?
Urgency Locking execution and deployment due to performance issue
System information
To Reproduce
Expected behavior faster inference time
Thanks for your attention, Don't hesitate to contact me for additionnal information