microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.25k stars 2.87k forks source link

onnxruntime Jetson tx2 cuda #8771

Open ArmandZampieri opened 3 years ago

ArmandZampieri commented 3 years ago

Describe the bug HI, I am trying to use onnxruntime for execution on the jetson TX2, but encountered challenging issue in performances. I am trying to deploy a siamrpn Network on the jetson and couldn't use tensorrt for this since Nvidia doesn't support one of the layer. onnxruntime does but the performance are not as expected.

DESCRIPTION OF USAGE

I build onnxruntime with cuda support and used the C++ / C onnxruntime API for inference. The IO are bound to GPU memory using: auto memory_info = Ort::MemoryInfo("Cuda", OrtAllocatorType::OrtArenaAllocator, 0, OrtMemTypeDefault); and passing this to Ort::Value::CreateTensor(memory_info, ...)

I used a OrtCudaProviderOptions and filled it with the important information(device_id, arena_extended_strategy, cudnn...algosearch::EXHAUSTIVE) add it to the session options session_options.AppendExecutionProvider_CUDA(options); and runs it with a call to void Run(const RunOptions& run_options, const char* const* input_names, const Value* input_values, size_t input_count, const char* const* output_names, Value* output_values, size_t output_count);

input and output buffers are allocated on the GPU and execution complete

PROBLEM OBSERVED The execution time is very slow (a little slower than CPU) to debug this I used the onnxruntime profiller and observed this image

I ignored the first iteration where some initial caching may occurs but what we can see is on all the 9 next operation, execution seems fast, but the model_run phase continue and lock execution for a long time 538ms for model_run 8ms for SequentialExecutor::Execute

I can't manage to find the reason why this is taking so much time. after the SequentialExecutor complete After verification all operations seems to occur on GPU, input and output shall be on GPU

Do you have any idea of what could be going wrong in this scenario ?

Urgency Locking execution and deployment due to performance issue

System information

To Reproduce

Expected behavior faster inference time

Thanks for your attention, Don't hesitate to contact me for additionnal information

stale[bot] commented 2 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.