Open mengxia1994 opened 1 year ago
By the way, I have to use onnxruntime because tensorrt7.1 inference result not right(accurate). It is limited by jetpack. Trt7.2 works with custom plugin.
@mengxia1994,
It's possible some operators are executed in CPU. Follow this example to run profiling: https://onnxruntime.ai/docs/api/python/auto_examples/plot_profiling.html
You can search "CPUExecutionProvider" in the output JSON to find out which nodes are executed in CPU. Or using some script to do some statistics to find the operator that spent most time in CPU like
python -m onnxruntime.transformers.profiler --input profile_2021-10-25_12-02-41.json
The source code of the tool doing statistics is here: https://github.com/microsoft/onnxruntime/blob/37033975bbc8fee7f32073b217654f308c529ccd/onnxruntime/python/tools/transformers/profiler.py#L269-L318
@mengxia1994,
It's possible some operators are executed in CPU. Follow this example to run profiling: https://onnxruntime.ai/docs/api/python/auto_examples/plot_profiling.html
You can search "CPUExecutionProvider" in the output JSON to find out which nodes are executed in CPU. Or using some script to do some statistics to find the operator that spent most time in CPU like
python -m onnxruntime.transformers.profiler --input profile_2021-10-25_12-02-41.json
The source code of the tool doing statistics is here:
Thank you for your reply, I will try. Assume that I located the CPUExecutionProvider nodes, what can i do next to optimize?
Assume that I located the CPUExecutionProvider nodes, what can i do next to optimize?
The solution is case by case. Some only need export model with different opset version; Some need implement the operator in CUDA; some need run graph fusion to use some contrib ops to replace a group of nodes.
@tianleiwu could you please elaborate on what those case by cases mean. I am having issue with onnx using alot of CPU while CudaEXecutionProvider has been set. The model is actually running on gpu, but also using CPU significantly. I am resource constrained on the CPU side (other processes are taking it) but have decent gpu (GeForce RTX 2060 Mobile, 6144MiB). I appreciate the help.
您好,我已收到您的邮件,我将尽快查阅并给予回复,谢谢!(孟夏的自动回复)
@mengxia1994 have you solved this problem? could you please give me some suggestions?
Describe the issue
running on xavier gpu using cuda provider but cpu usage high. By using command top, I can see it used gpu successfully(8~9 fps). But it also used 150% of one core cpu. I want to know why and how to solve. If i use cpulimit to control it under 50%, the inference time become slower(3 fps). Can iobinding help? Was it because of some of the ops were not supported on cuda and it has to calculate and copy data on cpu? The onnx model can be generated and inferenced successfully with opset16 without other custom plugin or something. I searched issues and found that most related issues care about memory, not cpu usage.
To reproduce
my code is like: (could also use some code advices) this->env = Ort::Env(OrtLoggingLevel::ORT_LOGGING_LEVEL_WARNING, "lane"); Ort::SessionOptions session_options; OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, 0); session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL); Ort::AllocatorWithDefaultOptions allocator; this->session = new Ort::Session(this->env, model_path.c_str(), session_options); auto input_tensors = Ort::Value::CreateTensor(memory_info, (float)Transed_t.data, this->width this->height * 3,
this->input_rgb_dims.data(), this->input_rgb_dims.size());
std::vector output_tensors;
output_tensors = this->session->Run(Ort::RunOptions{nullptr},
input_node_names.data(),
&input_tensors, // input tensors 1, // 1 output_node_names.data(),
output_node_names.size()); // 5
Urgency
very urgency
Platform
Linux
OS Version
4.9.201
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.12
ONNX Runtime API
C++
Architecture
ARM64
Execution Provider
CUDA
Execution Provider Library Version
cuda 10.2 (jetpack 4.5)
Model File
No response
Is this a quantized model?
No