microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.47k stars 2.9k forks source link

[Performance] running on xavier gpu but cpu usage high #14676

Open mengxia1994 opened 1 year ago

mengxia1994 commented 1 year ago

Describe the issue

running on xavier gpu using cuda provider but cpu usage high. By using command top, I can see it used gpu successfully(8~9 fps). But it also used 150% of one core cpu. I want to know why and how to solve. If i use cpulimit to control it under 50%, the inference time become slower(3 fps). Can iobinding help? Was it because of some of the ops were not supported on cuda and it has to calculate and copy data on cpu? The onnx model can be generated and inferenced successfully with opset16 without other custom plugin or something. I searched issues and found that most related issues care about memory, not cpu usage.

To reproduce

my code is like: (could also use some code advices) this->env = Ort::Env(OrtLoggingLevel::ORT_LOGGING_LEVEL_WARNING, "lane"); Ort::SessionOptions session_options; OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, 0); session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL); Ort::AllocatorWithDefaultOptions allocator; this->session = new Ort::Session(this->env, model_path.c_str(), session_options); auto input_tensors = Ort::Value::CreateTensor(memory_info, (float)Transed_t.data, this->width this->height * 3, this->input_rgb_dims.data(), this->input_rgb_dims.size()); std::vector output_tensors; output_tensors = this->session->Run(Ort::RunOptions{nullptr}, input_node_names.data(),
&input_tensors, // input tensors 1, // 1 output_node_names.data(),
output_node_names.size()); // 5

Urgency

very urgency

Platform

Linux

OS Version

4.9.201

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.12

ONNX Runtime API

C++

Architecture

ARM64

Execution Provider

CUDA

Execution Provider Library Version

cuda 10.2 (jetpack 4.5)

Model File

No response

Is this a quantized model?

No

mengxia1994 commented 1 year ago

By the way, I have to use onnxruntime because tensorrt7.1 inference result not right(accurate). It is limited by jetpack. Trt7.2 works with custom plugin.

tianleiwu commented 1 year ago

@mengxia1994,

It's possible some operators are executed in CPU. Follow this example to run profiling: https://onnxruntime.ai/docs/api/python/auto_examples/plot_profiling.html

You can search "CPUExecutionProvider" in the output JSON to find out which nodes are executed in CPU. Or using some script to do some statistics to find the operator that spent most time in CPU like

     python -m onnxruntime.transformers.profiler --input profile_2021-10-25_12-02-41.json

The source code of the tool doing statistics is here: https://github.com/microsoft/onnxruntime/blob/37033975bbc8fee7f32073b217654f308c529ccd/onnxruntime/python/tools/transformers/profiler.py#L269-L318

mengxia1994 commented 1 year ago

@mengxia1994,

It's possible some operators are executed in CPU. Follow this example to run profiling: https://onnxruntime.ai/docs/api/python/auto_examples/plot_profiling.html

You can search "CPUExecutionProvider" in the output JSON to find out which nodes are executed in CPU. Or using some script to do some statistics to find the operator that spent most time in CPU like

     python -m onnxruntime.transformers.profiler --input profile_2021-10-25_12-02-41.json

The source code of the tool doing statistics is here:

https://github.com/microsoft/onnxruntime/blob/37033975bbc8fee7f32073b217654f308c529ccd/onnxruntime/python/tools/transformers/profiler.py#L269-L318

Thank you for your reply, I will try. Assume that I located the CPUExecutionProvider nodes, what can i do next to optimize?

tianleiwu commented 1 year ago

Assume that I located the CPUExecutionProvider nodes, what can i do next to optimize?

The solution is case by case. Some only need export model with different opset version; Some need implement the operator in CUDA; some need run graph fusion to use some contrib ops to replace a group of nodes.

wendwosenbb commented 10 months ago

@tianleiwu could you please elaborate on what those case by cases mean. I am having issue with onnx using alot of CPU while CudaEXecutionProvider has been set. The model is actually running on gpu, but also using CPU significantly. I am resource constrained on the CPU side (other processes are taking it) but have decent gpu (GeForce RTX 2060 Mobile, 6144MiB). I appreciate the help.

mengxia1994 commented 10 months ago

您好,我已收到您的邮件,我将尽快查阅并给予回复,谢谢!(孟夏的自动回复)

wendwosenbb commented 10 months ago

@mengxia1994 have you solved this problem? could you please give me some suggestions?