Open nssrivathsa opened 2 years ago
I think the problem is that you measure CPU time instead of GPU time in the pytorch case. Cuda is asynchronous, so CPU timers like time.time() won't measure the right inference time. You can try the solution provided in this pytorch discussion and see if it's more consistent with onnx time.
I think the problem is that you measure CPU time instead of GPU time in the pytorch case. Cuda is asynchronous, so CPU timers like time.time() won't measure the right inference time. You can try the solution provided in this pytorch discussion and see if it's more consistent with onnx time.
Thanks for the reply
I tried what was mentioned in the link you provided by modifying my infer_pytorch(resnet)
function as
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for i in range(total_samples):
resnet.eval()
with torch.no_grad():
out = resnet(x)
end.record()
torch.cuda.synchronize()
print('Number of runs:', total_samples)
print("Average PyTorch {} Inference time = {} ms".format(device.type, start.elapsed_time(end)/total_samples))
Unfortunately, still see not much of a difference in the pytorch inference time
Average PyTorch cuda Inference time = 9.928244140625 ms
The onnxruntime inference is still the same (around 48 ms), so its still around 5 times slower
I think you need to put the record and the synchronize inside your loop. Something like that
with torch.no_grad():
resnet.eval()
for i in range(total_samples):
start.record()
out = resnet(x)
end.record()
torch.cuda.synchronize()
total_time += start.elapsed_time(end)
I think you need to put the record and the synchronize inside your loop. Something like that
with torch.no_grad(): resnet.eval() for i in range(total_samples): start.record() out = resnet(x) end.record() torch.cuda.synchronize() total_time += start.elapsed_time(end)
Changed it the way you have suggested.
Still get it as,
Average PyTorch cuda Inference time = 9.747972169876098 ms
Ok so it's probably not that
Is the input actually on GPU? In the code you provided, I see:
io_binding.bind_cpu_input(input_names, x)
Some other things to try: ORT has some support for profiling: https://onnxruntime.ai/docs/performance/tune-performance.html#profiling-and-performance-report
You can also use CUDA profiling tools like nvprof to get more detailed GPU usage info.
@nssrivathsa, it is not fair comparison since latency for ORT including copying input tensor from cpu to GPU, while pytorch does not. Could you bind GPU input, and measure the latency again?
Is the input actually on GPU? In the code you provided, I see:
io_binding.bind_cpu_input(input_names, x)
Some other things to try: ORT has some support for profiling: https://onnxruntime.ai/docs/performance/tune-performance.html#profiling-and-performance-report
You can also use CUDA profiling tools like nvprof to get more detailed GPU usage info.
@edgchen1 1) Yes, the input is on the GPU. io_binding.bind_cpu_input puts the input from the CPU to GPU as per 'Scenario 1' in this - http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html under IOBinding. Nevertheless, I have modified it as
data = onnxruntime.OrtValue.ortvalue_from_numpy(x, device.type, 0)
io_binding.bind_input(input_names, device, 0, np.float32, [batch_size, 3, 224, 224], data.data_ptr())
and I still see no difference in the timing
2) I did go through the profiling stuff from onnx and generated the runtime json file but could not infer much from it once I opened it from Google Chrome. Is there any help available on how to interpret the results?
@nssrivathsa, it is not fair comparison since latency for ORT including copying input tensor from cpu to GPU, while pytorch does not. Could you bind GPU input, and measure the latency again?
@tianleiwu
As you can see for my above reply, io_binding.bind_cpu_input should also do what you are suggesting. Nevertheless, once I modify my code as -
data = onnxruntime.OrtValue.ortvalue_from_numpy(x, device.type, 0)
io_binding.bind_input(input_names, device, 0, np.float32, [batch_size, 3, 224, 224], data.data_ptr())
still the timings on GPU for 1000 samples are -
Average onnxruntime cuda Inference time = 44.55 ms
Average PyTorch cuda Inference time = 9.780553787231446 ms
Yes, the input is on the GPU. io_binding.bind_cpu_input puts the input from the CPU to GPU as per 'Scenario 1' in this - http://www.xavierdupre.fr/app/onnxruntime/helpsphinx/api_summary.html under IOBinding.
The official docs are here: https://onnxruntime.ai/docs/api/python/api_summary.html
https://onnxruntime.ai/docs/api/python/api_summary.html#onnxruntime.IOBinding.bind_cpu_input The input is on CPU so it will need to be copied to the GPU.
I did go through the profiling stuff from onnx and generated the runtime json file but could not infer much from it once I opened it from Google Chrome. Is there any help available on how to interpret the results?
@RandySheriffH is there some documentation about CUDA kernel profiling in ORT?
You can look for host to device memcpy's with nvprof to see whether the input is actually getting copied from CPU to GPU. https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview Visual profiler can be used to view the nvprof results. https://docs.nvidia.com/cuda/profiler-users-guide/index.html#visual
I could not find the issue with profiling, as everything seems to take more time But, I observed that running it on a machine with a more advanced GPU ensures that onnxruntime is faster than Pytorch For e.g. on a Tesla T4 or Tesla V100 GPU, onnxruntime is around 4 times faster than pytorch
@nssrivathsa, any update on this? Are you able to find the cause? If you have nvprof results of ORT and PyTorch. You can compare the profiling result to see which cuda kernel or memory copy that ORT spent more time.
@tianleiwu No, i am still not able to find the cause. But, like I mentioned, it does not happen when i run it with some advanced GPUs
@nssrivathsa Even on Tesla T4, I am seeing 2x slow down using onnx versus pytorch + fp16.
@yunjiangster I haven't used any quantization yet, the script is same as what I have shared above and with that I see improvements when i use it with Tesla T4 on AWS
We're having the same issues with our models - seeing a ~2x slow-down between running our models on GPU with PyTorch vs with ONNX-runtime. This is very problematic, and forces us to search for another solution for putting our models in production... Any help / update on this issue would be greatly appreciated! I'm happy to assist in the debugging if it can help, thanks!
FYI, using the above example seems to work for us though, we are seeing similar speeds between the ONNX and PyTorch models. In our case, we are using a 3D UNet model (see here), with similar options as above to convert to ONNX). What could be the causes of such a slow-down? Could it be due to some unsupported operations for example? I can attach the model graph if this can help
Seeing similar problems as well -- we saw 2.3x slower on ORT + CUDA EP
compared to native PyTorch inference. In our case, the model is a large complicated diffusion model. I have a hunch that it's caused by the exported onnx graph having too many glue operators when exporting.
Could people from Microsoft step out and confirm whether it's the case? Thanks
We're having the same issues with our models - seeing a ~2x slow-down between running our models on GPU with PyTorch vs with ONNX-runtime. This is very problematic, and forces us to search for another solution for putting our models in production... Any help / update on this issue would be greatly appreciated! I'm happy to assist in the debugging if it can help, thanks!
FYI, using the above example seems to work for us though, we are seeing similar speeds between the ONNX and PyTorch models. In our case, we are using a 3D UNet model (see here), with similar options as above to convert to ONNX). What could be the causes of such a slow-down? Could it be due to some unsupported operations for example? I can attach the model graph if this can help
I've made a new issue for our problem as it might not be fully related to this one, since I cannot reproduce the same slow-down with the ResNet model
@CanyonWind &thomas-beznik, I recommend nvprof tool to profile your model. It will tell you which kernel uses most kernel time. For example, older version of huggingface stable diffusion pipeline, the UNet model has ScatterND, which is very slow in ORT. It is able to modify the modeling code to avoid those slow operators.
For diffusion model with Conv operator, I suggest try cudnn_conv_use_max_workspace in CUDA provider option like this. Sometime, it could make big difference.
In my benchmark of huggingface stable diffusion pipeline, if you use original ONNX FP32 model, it could be 2x slower. If you combine fp16 and conv algo search, ORT could be 25% faster than Pytorch:
Latency (Seconds per Query) | T4 | V100 | A100 |
---|---|---|---|
PyTorch FP16 | 12.8 | 5.1 | 3.1 |
ORT FP32 | 28.4 | 10.1 | 7.2 |
ORT FP16 | 13.1 | 6.4 | 4.3 |
ORT FP16 + Conv Algo Search | 9.6 | 3.8 | 2.4 |
@nssrivathsa, I run your script in V100 GPU with PyTorch 1.12.1+cu116, OnnxRuntime-gpu 1.12.1, with latest CUDA 11.7, and latest cuDNN 8.5.0.96. Here is the output:
It seems that ORT is much faster than PyTorch in V100.
Here is the script:
import os
import torch
import onnx
import torchvision.models as models
import onnxruntime
import time
import numpy
batch_size = 128
total_samples = 1000
device = torch.device('cuda:0')
def convert_to_onnx(resnet):
resnet.eval()
dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
input_names = [ 'input' ]
output_names = [ 'output' ]
torch.onnx.export(resnet,
dummy_input,
"resnet18.onnx",
verbose=True,
opset_version=13,
input_names=input_names,
output_names=output_names,
export_params=True,
do_constant_folding=True,
dynamic_axes={
'input': {0: 'batch_size'}, # variable length axes
'output': {0: 'batch_size'}}
)
def infer_pytorch(resnet):
print('Pytorch Inference')
print('==========================')
print()
x = torch.randn((batch_size, 3, 224, 224))
x = x.to(device=device)
latency = []
for i in range(total_samples):
t0 = time.time()
resnet.eval()
with torch.no_grad():
out = resnet(x)
latency.append(time.time() - t0)
print('Number of runs:', len(latency))
print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))
def to_numpy(tensor):
return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
def infer_onnxruntime():
print('Onnxruntime Inference')
print('==========================')
print()
onnx_model = onnx.load("resnet18.onnx")
onnx.checker.check_model(onnx_model)
# Input
x = torch.randn((batch_size, 3, 224, 224))
x = x.to(device=device).contiguous()
so = onnxruntime.SessionOptions()
exproviders = ["CUDAExecutionProvider", "CPUExecutionProvider"]
model_onnx_path = os.path.join(".", "resnet18.onnx")
ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)
#IOBinding see https://onnxruntime.ai/docs/api/python/api_summary.html for more info
input_name = ort_session.get_inputs()[0].name
output_name = ort_session.get_outputs()[0].name
io_binding = ort_session.io_binding()
io_binding.bind_input(
name=input_name,
device_type='cuda',
device_id=0,
element_type=numpy.float32,
shape=tuple(x.shape),
buffer_ptr=x.data_ptr(),
)
io_binding.bind_output(output_name, 'cuda')
#warm up run
ort_session.run_with_iobinding(io_binding)
ort_outs = io_binding.copy_outputs_to_cpu()
latency = []
for i in range(total_samples):
t0 = time.time()
ort_session.run_with_iobinding(io_binding)
latency.append(time.time() - t0)
ort_outs = io_binding.copy_outputs_to_cpu()
print('Number of runs:', len(latency))
print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))
if __name__ == '__main__':
torch.cuda.empty_cache()
resnet = (models.resnet18(pretrained=True)).to(device=device)
convert_to_onnx(resnet)
infer_onnxruntime()
infer_pytorch(resnet)
Similar issue described here - https://github.com/microsoft/onnxruntime/issues/12880#issuecomment-1330442188
@tianleiwu Hi, thanks for the script. However, when batch size is large, like 128, ONNX is still much slower than PyTorch.
@ma-xu, what environment (PyTorch, CUDA, cuDNN verison and GPU) did you find ONNX is much slower than PyTorch?
Onnx Runtime and PyTorch both use cuDNN as backend for convolution. I would expect the latency is close for large batch size. If you find significant difference for large batch size, that usually indicates some integration issue.
I did a test of batch size 128 in V100 with CUDA 11.7, PyTorch 1.13.1+cu117, ORT 1.14.0. ORT is still faster:
Onnxruntime Inference
==========================
Number of runs: 1000
Average onnxruntime cuda Inference time = 24.17 ms
Pytorch Inference
==========================
Number of runs: 1000
Average PyTorch cuda Inference time = 30.39 ms
@tianleiwu Thanks for your reply. It may be caused my some onnx errors, I uploaded the output here. Hope this can help.
My configure is: system_verision_python 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] system_verision_pytorch 1.12.1 system_verision_timm 0.6.12 system_verision_cuda 11.3 system_verision_cudnn 8302 system_verision_onnx 1.13.0 system_verision_onnxruntime 1.13.1 system_verision_tensorrt 8.5.2.2 system_gpu_name NVIDIA A100-SXM4-40GB system_cpu_arch X86_64 system_cpu_brand_raw AMD EPYC 7742 64-Core Processor
I have an exported model from Tensorflow to onnx. The onnx runtime using DirectML is also very slow on RDNA2 platform. It is about 3 times faster than CPU, but that is really slow. You can even see that the GPU is not utilised properly as the power consumption of the card won't go up.
I also encountered this problem when deploying the model. When reasoning with onnxruntime-gpu, the running time is slower than that with torch model. Do you have any research progress now? I really need your help, thank you very much!
Describe the bug Inference time of onnxruntime is 5x times slower as compared to the pytorch model on GPU BUT 2.5x times faster on CPU
System information
To Reproduce
Current behavior
If run on CPU,
but, if run on GPU, I see
If I change graph optimizations to onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL, I see some improvements in inference time on GPU, but its still slower than Pytorch.
I had read about similar issues here and ensured that i do the io binding so that the inputs are on GPU. When converting the resnet to onnx, I see traces like
%193 : Float(64, 3, 7, 7, strides=[147, 49, 7, 1], requires_grad=0, device=cuda:0),
so, the nodes of the model are on GPU.Further, during the processing for onnxruntime, I print device usage stats and I see this -
So, GPU device is being used. Further, I have used the resnet18.onnx model from the ModelZoo to see if it is a converted mode issue, but i get the same results.
So, I cannot seem to figure this out any further and I am stuck here since quite a few days. Could somebody please point out to what could be the issue here?
Thanks