microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.66k stars 2.78k forks source link

[Performance] CPU inference much slower from GPU runtime #19451

Open oliver-bernhardt opened 5 months ago

oliver-bernhardt commented 5 months ago

Describe the issue

Hey We are planning to add GPU inference (using Mirosoft.ML.OnnxRuntime.Gpu 1.17.0) as an option in our C# software. However, when switching from the CPU ONNX runtime to the GPU ONNX runtime we noticed a significant drop in inference speed when still running it using the CPU provider. I know this package is meant to be used for running it with CUDA but we can not expect all of our users to have a CUDA compatible GPU so CPU runtime should still be a viable option for those. I simply fail to understand why there is a 20% increase in CPU inference time when using the same version of ONNX but simply switching to the CUDA compatible package.

To reproduce

Simply run any inference in C# using default CPU execution provider, once using Microsoft.ML.OnnxRuntime 1.17.0 and once using Microsoft.ML.OnnxRuntime.Gpu. The difference was very obvious for me and happened on any model I tried.

Urgency

No response

Platform

Windows

OS Version

Windows 11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.0

ONNX Runtime API

C#

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

oliver-bernhardt commented 5 months ago

Can someone comment on whether this is known and reproducible? Am I doing something wrong?

tianleiwu commented 5 months ago

I did not observe this from average latency of running 1000 inference runs on a model with python. Usually, we skip a few warm up runs before measurement. @oliver-bernhardt, how do you measure the inference time?

oliver-bernhardt commented 5 months ago

I am working in C# so maybe there is a difference there. I not only measured inference directly but we also have a general framework for benchmarking our software. Once introducing the change we also saw a 30% increase in runtime overall.

I usually run a couple of million inferences. There you can see a clear difference (we are talking about a couple of minutes total inference time and an increase of almost a minute in total runtime just by changing the onnx runtime from CPU to GPU)

yuslepukhin commented 5 months ago

Can you give an example of the model to be on the same page? Also, a sample program is usually helpful to catch any issues with the code as well. Also are you using new OrtValue API or old one? C# API is not aware much about EPs. One thing is present though, is copying data inputs to GPU and results from GPU.

oliver-bernhardt commented 5 months ago

Hi No, I was not aware of the OrtValue API. I was using code like this:

byte[,] data = ...; NamedOnnxValue.CreateFromTensor("input", data.ToTensor());

I will check out OrtValue and come back with what I find.

> Can you give an example of the model to be on the same page? Also, a sample program is usually helpful to catch any issues with the code as well. I see what I can do. I am working with propitiatory code and models here so it will not be easy for me to give an example. But it is already good to know that this is not a known issue with the two different versions of this API.

Oli

oliver-bernhardt commented 4 months ago

So far I still was not able to figure out why the CPU execution is up to 30% slower from Mirosoft.ML.OnnxRuntime.Gpu 1.17.0 compared to Mirosoft.ML.OnnxRuntime 1.17.0.

Am I really the only person who observes this?

I would really like to move our application to support GPU execution but it is an absolute no-go for us if clients that don't have a CUDA GPU now suffer a 30% drop in performance.

Is there a way I can change packages client side on the fly? (like shipping with both Mirosoft.ML.OnnxRuntime and Mirosoft.ML.OnnxRuntime.Gpu but picking the desired package on-the-fly?)

HarperIcey commented 2 weeks ago

I also encountered this problem when deploying the model. When reasoning with onnxruntime-gpu, the running time is slower than that with torch model. Do you have any research progress now? I really need your help, thank you very much!