Open mjay2016 opened 3 years ago
Which opset is your model using ?
Which opset is your model using ?
i'm using onnxruntime version 1.7 i.e. Opset 13
Hi @mjay2016 : Are you able to build from source and try with the Pad kernel registration fix ? Do you see these warnings go away and does the perf improve ?
yes @hariharans29. I was able to run new version and no more warning for Pad kernel. Thanks.
Thanks! Is the perf better now ?
@hariharans29 Nope. No improvement in the performance. Any suggestion debug further to see where issue is?
Do you know which kernels are the most expensive ? Using nvprof to profile (using random data to run the model will do) and sharing that info would be a good start to debugging the issue.
Do you know which kernels are the most expensive ? Using nvprof to profile (using random data to run the model will do) and sharing that info would be a good start to debugging the issue.
I will try this and get back to you.
@hariharans29
Here is the conde snippet where i'm measuring the runtime:
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for (int i = 0; i < 2; i++)
{
std::cout << "Inference "<< i <<std::endl;
OrtErrorHanlder(ort_api_handle->Run(ort_session, NULL, input_names, (const OrtValue* const*)&input_tensor, 1, output_names, 1, &output_tensor));
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Minimum Inference Latency: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(end -begin).count() /static_cast<float>(2)
<< " ms" << std::endl;
Here is the nvprof log: I've listed only operation I see that takes more time. Any suggestion if f further debug is required.
Type Time(%) Time Calls Avg Min Max Name GPU activities: 44.09% 921.22ms 42 21.934ms 2.5311ms 80.672ms void onnxruntime::cuda::_PadKernel<float, int=1> 5.38% 112.44ms 14 8.0313ms 1.0356ms 23.598ms volta_scudnn_128x64_relu_interior_nn_v1 5.24% 109.41ms 9 12.157ms 10.855ms 21.809ms volta_sgemm_128x128_nn 5.12% 107.03ms 10 10.703ms 3.3354ms 24.753ms void implicit_convolve_sgemm 3.41% 71.236ms 18 3.9575ms 2.1426ms 4.2289ms void implicit_convolve_sgemm 3.31% 69.251ms 40 1.7313ms 1.3258ms 3.8783ms volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1 1.61% 33.682ms 9 3.7424ms 2.8026ms 11.218ms void cudnn::winograd_nonfused::winogradForwardData4x4 1.18% 24.580ms 9 2.7312ms 2.4574ms 4.9061ms void cudnn::winograd_nonfused::winogradForwardOutput4x4 1.10% 22.980ms 13 1.7677ms 609.50us 7.5991ms void cudnn::cnn::im2col4d_kernel 0.66% 13.817ms 1 13.817ms 13.817ms 13.817ms void explicit_convolve_sgemm
Thanks. It looks like Pad's performance is poor and that needs to be addressed but it looks like that is not the only problem since pad seems to account for only 80ms out of the 400ms. We would haver to investigate possible improvements in other kernels as well. Can you please share your model and we will try and get to it soon.
Hi @mjay2016,
Pad's performance is now improved (https://github.com/microsoft/onnxruntime/pull/8408). Based on your nvprof output, I would expect that the perf of your model should be much better after this change. Can you please re-build from master and give it a shot ?
Hi
Seeing different execution time for PyTorch model and Onnx with Onnxruntime on Nvidia GPU. Running inference on PyTorch model take 76ms and onnxruntime takes 400ms.
Getting following message while inferring the model with Onnxruntime. I believe the performance impact is because of the operation not found and onnxruntime is executing it in CPU. If this is the problem how to fix it.
System Details: