onnxruntime Jetson tx2 cuda

Describe the bug HI, I am trying to use onnxruntime for execution on the jetson TX2, but encountered challenging issue in performances. I am trying to deploy a siamrpn Network on the jetson and couldn't use tensorrt for this since Nvidia doesn't support one of the layer. onnxruntime does but the performance are not as expected.

DESCRIPTION OF USAGE

I build onnxruntime with cuda support and used the C++ / C onnxruntime API for inference. The IO are bound to GPU memory using: auto memory_info = Ort::MemoryInfo("Cuda", OrtAllocatorType::OrtArenaAllocator, 0, OrtMemTypeDefault); and passing this to Ort::Value::CreateTensor(memory_info, ...)

I used a OrtCudaProviderOptions and filled it with the important information(device_id, arena_extended_strategy, cudnn...algosearch::EXHAUSTIVE) add it to the session options session_options.AppendExecutionProvider_CUDA(options); and runs it with a call to void Run(const RunOptions& run_options, const char* const* input_names, const Value* input_values, size_t input_count, const char* const* output_names, Value* output_values, size_t output_count);

input and output buffers are allocated on the GPU and execution complete

PROBLEM OBSERVED The execution time is very slow (a little slower than CPU) to debug this I used the onnxruntime profiller and observed this

I ignored the first iteration where some initial caching may occurs but what we can see is on all the 9 next operation, execution seems fast, but the model_run phase continue and lock execution for a long time 538ms for model_run 8ms for SequentialExecutor::Execute

I can't manage to find the reason why this is taking so much time. after the SequentialExecutor complete After verification all operations seems to occur on GPU, input and output shall be on GPU

Do you have any idea of what could be going wrong in this scenario ?

Urgency Locking execution and deployment due to performance issue

System information

Linux4Tegra
ONNX Runtime from source with CUDA support 1.7.0:
Python version: 3.6
GCC 7.5.0
CUDA 10.2 /cuDNN 8.0 version:
GPU model and memory: jetson (nvidia Pascal 256 core)

To Reproduce

Describe steps/code to reproduce the behavior. (see above) model: https://drive.google.com/drive/folders/1vgXMEd88zmXuzp_Zid0HHwR6CmtpaFeD?usp=sharing

Expected behavior faster inference time

Thanks for your attention, Don't hesitate to contact me for additionnal information

microsoft / onnxruntime

onnxruntime Jetson tx2 cuda #8771