onnxruntime inference is around 5 times slower than pytorch when using GPU

nssrivathsa commented 2 years ago

Describe the bug Inference time of onnxruntime is 5x times slower as compared to the pytorch model on GPU BUT 2.5x times faster on CPU

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
ONNX Runtime installed from (source or binary): Source
ONNX Runtime version: 1.11.0 (onnx version 1.10.1)
Python version: 3.8.12
CUDA/cuDNN version: cuda version 11.5, cudnn version 8.2
GPU model and memory: Quadro M2000M, 4 GB

To Reproduce

batch_size = 1
total_samples = 1000
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

def convert_to_onnx(resnet):
   resnet.eval()
   dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
   input_names = [ 'input' ]
   output_names = [ 'output' ]
   torch.onnx.export(resnet, 
               dummy_input,
               "resnet18.onnx",
               verbose=True,
               opset_version=13,
               input_names=input_names,
               output_names=output_names,
               export_params=True,
               do_constant_folding=True,
               dynamic_axes={
                  'input': {0: 'batch_size'},  # variable length axes
                  'output': {0: 'batch_size'}}        
               )

def infer_pytorch(resnet):
   print('Pytorch Inference')
   print('==========================')
   print()

   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)

   latency = []
   for i in range(total_samples):
      t0 = time.time()
      resnet.eval()
      with torch.no_grad():
         out = resnet(x)
      latency.append(time.time() - t0)

   print('Number of runs:', len(latency))
   print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))  

def to_numpy(tensor):
   return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def infer_onnxruntime():
   print('Onnxruntime Inference')
   print('==========================')
   print()

   onnx_model = onnx.load("resnet18.onnx")
   onnx.checker.check_model(onnx_model)

   # Input
   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)
   x = to_numpy(x)

   so = onnxruntime.SessionOptions()
   so.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
   so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL

   exproviders = ['CUDAExecutionProvider', 'CPUExecutionProvider']

   model_onnx_path = os.path.join(".", "resnet18.onnx")
   ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)

   options = ort_session.get_provider_options()
   cuda_options = options['CUDAExecutionProvider']
   cuda_options['cudnn_conv_use_max_workspace'] = '1'
   ort_session.set_providers(['CUDAExecutionProvider'], [cuda_options])

   #IOBinding
   input_names = ort_session.get_inputs()[0].name
   output_names = ort_session.get_outputs()[0].name
   io_binding = ort_session.io_binding()

   io_binding.bind_cpu_input(input_names, x)
   io_binding.bind_output(output_names, device)

   #warm up run
   ort_session.run_with_iobinding(io_binding)
   ort_outs = io_binding.copy_outputs_to_cpu()

   latency = []

   for i in range(total_samples):
      t0 = time.time()
      ort_session.run_with_iobinding(io_binding)
      latency.append(time.time() - t0)
      ort_outs = io_binding.copy_outputs_to_cpu()
   print('Number of runs:', len(latency))
   print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))   

if __name__ == '__main__':
   torch.cuda.empty_cache()
   resnet = (models.resnet18(pretrained=True)).to(device=device)
   convert_to_onnx(resnet)
   infer_onnxruntime()
   infer_pytorch(resnet)

Current behavior

If run on CPU,

Average onnxruntime cpu Inference time = 18.48 ms
Average PyTorch cpu Inference time = 51.74 ms

but, if run on GPU, I see

Average onnxruntime cuda Inference time = 47.89 ms
Average PyTorch cuda Inference time = 8.94 ms

If I change graph optimizations to onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL, I see some improvements in inference time on GPU, but its still slower than Pytorch.

I had read about similar issues here and ensured that i do the io binding so that the inputs are on GPU. When converting the resnet to onnx, I see traces like %193 : Float(64, 3, 7, 7, strides=[147, 49, 7, 1], requires_grad=0, device=cuda:0), so, the nodes of the model are on GPU.

Further, during the processing for onnxruntime, I print device usage stats and I see this -

Using device: cuda:0
GPU Device name: Quadro M2000M
Memory Usage:
Allocated: 0.1 GB
Cached:    0.1 GB

So, GPU device is being used. Further, I have used the resnet18.onnx model from the ModelZoo to see if it is a converted mode issue, but i get the same results.

So, I cannot seem to figure this out any further and I am stuck here since quite a few days. Could somebody please point out to what could be the issue here?

Thanks

adepierre commented 2 years ago

I think the problem is that you measure CPU time instead of GPU time in the pytorch case. Cuda is asynchronous, so CPU timers like time.time() won't measure the right inference time. You can try the solution provided in this pytorch discussion and see if it's more consistent with onnx time.

nssrivathsa commented 2 years ago

I think the problem is that you measure CPU time instead of GPU time in the pytorch case. Cuda is asynchronous, so CPU timers like time.time() won't measure the right inference time. You can try the solution provided in this pytorch discussion and see if it's more consistent with onnx time.

Thanks for the reply I tried what was mentioned in the link you provided by modifying my infer_pytorch(resnet) function as

   start = torch.cuda.Event(enable_timing=True)
   end = torch.cuda.Event(enable_timing=True)
   start.record()
   for i in range(total_samples):
      resnet.eval()
      with torch.no_grad():
         out = resnet(x)
   end.record()
   torch.cuda.synchronize()
   print('Number of runs:', total_samples)
   print("Average PyTorch {} Inference time = {} ms".format(device.type, start.elapsed_time(end)/total_samples))

Unfortunately, still see not much of a difference in the pytorch inference time

   Average PyTorch cuda Inference time = 9.928244140625 ms

The onnxruntime inference is still the same (around 48 ms), so its still around 5 times slower

adepierre commented 2 years ago

I think you need to put the record and the synchronize inside your loop. Something like that


with torch.no_grad():
    resnet.eval()
    for i in range(total_samples):
        start.record()
        out = resnet(x)
        end.record()
        torch.cuda.synchronize()
        total_time += start.elapsed_time(end)

nssrivathsa commented 2 years ago

I think you need to put the record and the synchronize inside your loop. Something like that

with torch.no_grad():
    resnet.eval()
    for i in range(total_samples):
        start.record()
        out = resnet(x)
        end.record()
        torch.cuda.synchronize()
        total_time += start.elapsed_time(end)

Changed it the way you have suggested. Still get it as, Average PyTorch cuda Inference time = 9.747972169876098 ms

adepierre commented 2 years ago

Ok so it's probably not that

edgchen1 commented 2 years ago

Is the input actually on GPU? In the code you provided, I see: io_binding.bind_cpu_input(input_names, x)

Some other things to try: ORT has some support for profiling: https://onnxruntime.ai/docs/performance/tune-performance.html#profiling-and-performance-report

You can also use CUDA profiling tools like nvprof to get more detailed GPU usage info.

tianleiwu commented 2 years ago

@nssrivathsa, it is not fair comparison since latency for ORT including copying input tensor from cpu to GPU, while pytorch does not. Could you bind GPU input, and measure the latency again?

nssrivathsa commented 2 years ago

Is the input actually on GPU? In the code you provided, I see: io_binding.bind_cpu_input(input_names, x)

Some other things to try: ORT has some support for profiling: https://onnxruntime.ai/docs/performance/tune-performance.html#profiling-and-performance-report

You can also use CUDA profiling tools like nvprof to get more detailed GPU usage info.

@edgchen1 1) Yes, the input is on the GPU. io_binding.bind_cpu_input puts the input from the CPU to GPU as per 'Scenario 1' in this - http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html under IOBinding. Nevertheless, I have modified it as

data = onnxruntime.OrtValue.ortvalue_from_numpy(x, device.type, 0)
io_binding.bind_input(input_names, device, 0, np.float32, [batch_size, 3, 224, 224], data.data_ptr())

and I still see no difference in the timing

2) I did go through the profiling stuff from onnx and generated the runtime json file but could not infer much from it once I opened it from Google Chrome. Is there any help available on how to interpret the results?

nssrivathsa commented 2 years ago

@nssrivathsa, it is not fair comparison since latency for ORT including copying input tensor from cpu to GPU, while pytorch does not. Could you bind GPU input, and measure the latency again?

@tianleiwu

As you can see for my above reply, io_binding.bind_cpu_input should also do what you are suggesting. Nevertheless, once I modify my code as -

data = onnxruntime.OrtValue.ortvalue_from_numpy(x, device.type, 0)
io_binding.bind_input(input_names, device, 0, np.float32, [batch_size, 3, 224, 224], data.data_ptr())

still the timings on GPU for 1000 samples are -

Average onnxruntime cuda Inference time = 44.55 ms
Average PyTorch cuda Inference time = 9.780553787231446 ms

edgchen1 commented 2 years ago

Yes, the input is on the GPU. io_binding.bind_cpu_input puts the input from the CPU to GPU as per 'Scenario 1' in this - http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html under IOBinding.

The official docs are here: https://onnxruntime.ai/docs/api/python/api_summary.html

https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.IOBinding.bind_cpu_input The input is on CPU so it will need to be copied to the GPU.

I did go through the profiling stuff from onnx and generated the runtime json file but could not infer much from it once I opened it from Google Chrome. Is there any help available on how to interpret the results?

@RandySheriffH is there some documentation about CUDA kernel profiling in ORT?

You can look for host to device memcpy's with nvprof to see whether the input is actually getting copied from CPU to GPU. https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview Visual profiler can be used to view the nvprof results. https://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual

nssrivathsa commented 2 years ago

I could not find the issue with profiling, as everything seems to take more time But, I observed that running it on a machine with a more advanced GPU ensures that onnxruntime is faster than Pytorch For e.g. on a Tesla T4 or Tesla V100 GPU, onnxruntime is around 4 times faster than pytorch

tianleiwu commented 2 years ago

@nssrivathsa, any update on this? Are you able to find the cause? If you have nvprof results of ORT and PyTorch. You can compare the profiling result to see which cuda kernel or memory copy that ORT spent more time.

nssrivathsa commented 2 years ago

@tianleiwu No, i am still not able to find the cause. But, like I mentioned, it does not happen when i run it with some advanced GPUs

yunjiangster commented 2 years ago

@nssrivathsa Even on Tesla T4, I am seeing 2x slow down using onnx versus pytorch + fp16.

nssrivathsa commented 2 years ago

@yunjiangster I haven't used any quantization yet, the script is same as what I have shared above and with that I see improvements when i use it with Tesla T4 on AWS

thomas-beznik commented 2 years ago

We're having the same issues with our models - seeing a ~2x slow-down between running our models on GPU with PyTorch vs with ONNX-runtime. This is very problematic, and forces us to search for another solution for putting our models in production... Any help / update on this issue would be greatly appreciated! I'm happy to assist in the debugging if it can help, thanks!

FYI, using the above example seems to work for us though, we are seeing similar speeds between the ONNX and PyTorch models. In our case, we are using a 3D UNet model (see here), with similar options as above to convert to ONNX). What could be the causes of such a slow-down? Could it be due to some unsupported operations for example? I can attach the model graph if this can help

CanyonWind commented 2 years ago

Seeing similar problems as well -- we saw 2.3x slower on ORT + CUDA EP compared to native PyTorch inference. In our case, the model is a large complicated diffusion model. I have a hunch that it's caused by the exported onnx graph having too many glue operators when exporting.

Could people from Microsoft step out and confirm whether it's the case? Thanks

thomas-beznik commented 2 years ago

We're having the same issues with our models - seeing a ~2x slow-down between running our models on GPU with PyTorch vs with ONNX-runtime. This is very problematic, and forces us to search for another solution for putting our models in production... Any help / update on this issue would be greatly appreciated! I'm happy to assist in the debugging if it can help, thanks!

FYI, using the above example seems to work for us though, we are seeing similar speeds between the ONNX and PyTorch models. In our case, we are using a 3D UNet model (see here), with similar options as above to convert to ONNX). What could be the causes of such a slow-down? Could it be due to some unsupported operations for example? I can attach the model graph if this can help

I've made a new issue for our problem as it might not be fully related to this one, since I cannot reproduce the same slow-down with the ResNet model

tianleiwu commented 2 years ago

@CanyonWind &thomas-beznik, I recommend nvprof tool to profile your model. It will tell you which kernel uses most kernel time. For example, older version of huggingface stable diffusion pipeline, the UNet model has ScatterND, which is very slow in ORT. It is able to modify the modeling code to avoid those slow operators.

For diffusion model with Conv operator, I suggest try cudnn_conv_use_max_workspace in CUDA provider option like this. Sometime, it could make big difference.

In my benchmark of huggingface stable diffusion pipeline, if you use original ONNX FP32 model, it could be 2x slower. If you combine fp16 and conv algo search, ORT could be 25% faster than Pytorch:

Latency (Seconds per Query)	T4	V100	A100
PyTorch FP16	12.8	5.1	3.1
ORT FP32	28.4	10.1	7.2
ORT FP16	13.1	6.4	4.3
ORT FP16 + Conv Algo Search	9.6	3.8	2.4

tianleiwu commented 2 years ago

@nssrivathsa, I run your script in V100 GPU with PyTorch 1.12.1+cu116, OnnxRuntime-gpu 1.12.1, with latest CUDA 11.7, and latest cuDNN 8.5.0.96. Here is the output:

It seems that ORT is much faster than PyTorch in V100.

Here is the script:

import os
import torch
import onnx
import torchvision.models as models
import onnxruntime
import time
import numpy

batch_size = 128
total_samples = 1000
device = torch.device('cuda:0')

def convert_to_onnx(resnet):
   resnet.eval()
   dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
   input_names = [ 'input' ]
   output_names = [ 'output' ]
   torch.onnx.export(resnet, 
               dummy_input,
               "resnet18.onnx",
               verbose=True,
               opset_version=13,
               input_names=input_names,
               output_names=output_names,
               export_params=True,
               do_constant_folding=True,
               dynamic_axes={
                  'input': {0: 'batch_size'},  # variable length axes
                  'output': {0: 'batch_size'}}
               )

def infer_pytorch(resnet):
   print('Pytorch Inference')
   print('==========================')
   print()

   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)

   latency = []
   for i in range(total_samples):
      t0 = time.time()
      resnet.eval()
      with torch.no_grad():
         out = resnet(x)
      latency.append(time.time() - t0)

   print('Number of runs:', len(latency))
   print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))  

def to_numpy(tensor):
   return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def infer_onnxruntime():
   print('Onnxruntime Inference')
   print('==========================')
   print()

   onnx_model = onnx.load("resnet18.onnx")
   onnx.checker.check_model(onnx_model)

   # Input
   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device).contiguous()

   so = onnxruntime.SessionOptions()
   exproviders = ["CUDAExecutionProvider", "CPUExecutionProvider"]

   model_onnx_path = os.path.join(".", "resnet18.onnx")
   ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)

   #IOBinding see https://onnxruntime.ai/docs/api/python/api_summary.html for more info
   input_name = ort_session.get_inputs()[0].name
   output_name = ort_session.get_outputs()[0].name
   io_binding = ort_session.io_binding()

   io_binding.bind_input(
    name=input_name,
    device_type='cuda',
    device_id=0,
    element_type=numpy.float32,
    shape=tuple(x.shape),
    buffer_ptr=x.data_ptr(),
   )

   io_binding.bind_output(output_name, 'cuda')

   #warm up run
   ort_session.run_with_iobinding(io_binding)
   ort_outs = io_binding.copy_outputs_to_cpu()

   latency = []

   for i in range(total_samples):
      t0 = time.time()
      ort_session.run_with_iobinding(io_binding)
      latency.append(time.time() - t0)
      ort_outs = io_binding.copy_outputs_to_cpu()
   print('Number of runs:', len(latency))
   print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))   

if __name__ == '__main__':
   torch.cuda.empty_cache()
   resnet = (models.resnet18(pretrained=True)).to(device=device)
   convert_to_onnx(resnet)
   infer_onnxruntime()
   infer_pytorch(resnet)

davidmezzetti commented 1 year ago

ma-xu commented 1 year ago

@tianleiwu Hi, thanks for the script. However, when batch size is large, like 128, ONNX is still much slower than PyTorch.

tianleiwu commented 1 year ago

@ma-xu, what environment (PyTorch, CUDA, cuDNN verison and GPU) did you find ONNX is much slower than PyTorch?

Onnx Runtime and PyTorch both use cuDNN as backend for convolution. I would expect the latency is close for large batch size. If you find significant difference for large batch size, that usually indicates some integration issue.

I did a test of batch size 128 in V100 with CUDA 11.7, PyTorch 1.13.1+cu117, ORT 1.14.0. ORT is still faster:

Onnxruntime Inference
==========================

Number of runs: 1000
Average onnxruntime cuda Inference time = 24.17 ms
Pytorch Inference
==========================

Number of runs: 1000
Average PyTorch cuda Inference time = 30.39 ms

ma-xu commented 1 year ago

@tianleiwu Thanks for your reply. It may be caused my some onnx errors, I uploaded the output here. Hope this can help.

My configure is: system_verision_python 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] system_verision_pytorch 1.12.1 system_verision_timm 0.6.12 system_verision_cuda 11.3 system_verision_cudnn 8302 system_verision_onnx 1.13.0 system_verision_onnxruntime 1.13.1 system_verision_tensorrt 8.5.2.2 system_gpu_name NVIDIA A100-SXM4-40GB system_cpu_arch X86_64 system_cpu_brand_raw AMD EPYC 7742 64-Core Processor

benchmark_out.txt

cminnoy commented 1 year ago

I have an exported model from Tensorflow to onnx. The onnx runtime using DirectML is also very slow on RDNA2 platform. It is about 3 times faster than CPU, but that is really slow. You can even see that the GPU is not utilised properly as the power consumption of the card won't go up.

HarperIcey commented 4 months ago

I also encountered this problem when deploying the model. When reasoning with onnxruntime-gpu, the running time is slower than that with torch model. Do you have any research progress now? I really need your help, thank you very much!

microsoft / onnxruntime

onnxruntime inference is around 5 times slower than pytorch when using GPU #10303