CUDA kernel not found in registries for Op type: Pad

mjay2016 commented 3 years ago

Hi

Seeing different execution time for PyTorch model and Onnx with Onnxruntime on Nvidia GPU. Running inference on PyTorch model take 76ms and onnxruntime takes 400ms.

Getting following message while inferring the model with Onnxruntime. I believe the performance impact is because of the operation not found and onnxruntime is executing it in CPU. If this is the problem how to fix it.

2021-05-20 15:28:09.391103916 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_13
2021-05-20 15:28:09.391193212 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_40
2021-05-20 15:28:09.391268284 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_67
2021-05-20 15:28:09.391325247 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_94
2021-05-20 15:28:09.391381860 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_121
2021-05-20 15:28:09.391446115 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_149
2021-05-20 15:28:09.391502132 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_176
2021-05-20 15:28:09.391586135 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_204
2021-05-20 15:28:09.391647495 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_231
2021-05-20 15:28:09.391717329 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_259
2021-05-20 15:28:09.391778485 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_286
2021-05-20 15:28:09.391854389 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_315
2021-05-20 15:28:09.391921141 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_343
2021-05-20 15:28:09.391982832 [W:onnxruntime:Default, cuda_execution_provider.cc:1983 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: Pad_370

System Details:

Onnxruntime: 1.7. Installed as pip install onnxruntime-gpu
Python 3.8
Ubuntu 20
CUDA 11.1 and NVIDIA RTX 3090

hariharans29 commented 3 years ago

Which opset is your model using ?

mjay2016 commented 3 years ago

Which opset is your model using ?

i'm using onnxruntime version 1.7 i.e. Opset 13

hariharans29 commented 3 years ago

Hi @mjay2016 : Are you able to build from source and try with the Pad kernel registration fix ? Do you see these warnings go away and does the perf improve ?

mjay2016 commented 3 years ago

yes @hariharans29. I was able to run new version and no more warning for Pad kernel. Thanks.

hariharans29 commented 3 years ago

Thanks! Is the perf better now ?

mjay2016 commented 3 years ago

@hariharans29 Nope. No improvement in the performance. Any suggestion debug further to see where issue is?

hariharans29 commented 3 years ago

Do you know which kernels are the most expensive ? Using nvprof to profile (using random data to run the model will do) and sharing that info would be a good start to debugging the issue.

mjay2016 commented 3 years ago

Do you know which kernels are the most expensive ? Using nvprof to profile (using random data to run the model will do) and sharing that info would be a good start to debugging the issue.

I will try this and get back to you.

mjay2016 commented 3 years ago

@hariharans29

Here is the conde snippet where i'm measuring the runtime:

std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    for (int i = 0; i < 2; i++)
    {
        std::cout << "Inference "<< i <<std::endl;
        OrtErrorHanlder(ort_api_handle->Run(ort_session, NULL, input_names, (const OrtValue* const*)&input_tensor, 1, output_names, 1, &output_tensor));        
    }
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::cout << "Minimum Inference Latency: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(end -begin).count() /static_cast<float>(2)
              << " ms" << std::endl;

Here is the nvprof log: I've listed only operation I see that takes more time. Any suggestion if f further debug is required.

Type Time(%) Time Calls Avg Min Max Name GPU activities: 44.09% 921.22ms 42 21.934ms 2.5311ms 80.672ms void onnxruntime::cuda::_PadKernel<float, int=1> 5.38% 112.44ms 14 8.0313ms 1.0356ms 23.598ms volta_scudnn_128x64_relu_interior_nn_v1 5.24% 109.41ms 9 12.157ms 10.855ms 21.809ms volta_sgemm_128x128_nn 5.12% 107.03ms 10 10.703ms 3.3354ms 24.753ms void implicit_convolve_sgemm 3.41% 71.236ms 18 3.9575ms 2.1426ms 4.2289ms void implicit_convolve_sgemm 3.31% 69.251ms 40 1.7313ms 1.3258ms 3.8783ms volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1 1.61% 33.682ms 9 3.7424ms 2.8026ms 11.218ms void cudnn::winograd_nonfused::winogradForwardData4x4 1.18% 24.580ms 9 2.7312ms 2.4574ms 4.9061ms void cudnn::winograd_nonfused::winogradForwardOutput4x4 1.10% 22.980ms 13 1.7677ms 609.50us 7.5991ms void cudnn::cnn::im2col4d_kernel 0.66% 13.817ms 1 13.817ms 13.817ms 13.817ms void explicit_convolve_sgemm

hariharans29 commented 3 years ago

Thanks. It looks like Pad's performance is poor and that needs to be addressed but it looks like that is not the only problem since pad seems to account for only 80ms out of the 400ms. We would haver to investigate possible improvements in other kernels as well. Can you please share your model and we will try and get to it soon.

hariharans29 commented 3 years ago

Hi @mjay2016,

Pad's performance is now improved (https://github.com/microsoft/onnxruntime/pull/8408). Based on your nvprof output, I would expect that the perf of your model should be much better after this change. Can you please re-build from master and give it a shot ?

microsoft / onnxruntime

CUDA kernel not found in registries for Op type: Pad #7779