Open oliver-bernhardt opened 5 months ago
Can someone comment on whether this is known and reproducible? Am I doing something wrong?
I did not observe this from average latency of running 1000 inference runs on a model with python. Usually, we skip a few warm up runs before measurement. @oliver-bernhardt, how do you measure the inference time?
I am working in C# so maybe there is a difference there. I not only measured inference directly but we also have a general framework for benchmarking our software. Once introducing the change we also saw a 30% increase in runtime overall.
I usually run a couple of million inferences. There you can see a clear difference (we are talking about a couple of minutes total inference time and an increase of almost a minute in total runtime just by changing the onnx runtime from CPU to GPU)
Can you give an example of the model to be on the same page? Also, a sample program is usually helpful to catch any issues with the code as well. Also are you using new OrtValue API or old one? C# API is not aware much about EPs. One thing is present though, is copying data inputs to GPU and results from GPU.
Hi No, I was not aware of the OrtValue API. I was using code like this:
byte[,] data = ...;
NamedOnnxValue.CreateFromTensor("input", data.ToTensor());
I will check out OrtValue and come back with what I find.
> Can you give an example of the model to be on the same page? Also, a sample program is usually helpful to catch any issues with the code as well. I see what I can do. I am working with propitiatory code and models here so it will not be easy for me to give an example. But it is already good to know that this is not a known issue with the two different versions of this API.
Oli
So far I still was not able to figure out why the CPU execution is up to 30% slower from Mirosoft.ML.OnnxRuntime.Gpu 1.17.0 compared to Mirosoft.ML.OnnxRuntime 1.17.0.
Am I really the only person who observes this?
I would really like to move our application to support GPU execution but it is an absolute no-go for us if clients that don't have a CUDA GPU now suffer a 30% drop in performance.
Is there a way I can change packages client side on the fly? (like shipping with both Mirosoft.ML.OnnxRuntime and Mirosoft.ML.OnnxRuntime.Gpu but picking the desired package on-the-fly?)
I also encountered this problem when deploying the model. When reasoning with onnxruntime-gpu, the running time is slower than that with torch model. Do you have any research progress now? I really need your help, thank you very much!
Describe the issue
Hey We are planning to add GPU inference (using Mirosoft.ML.OnnxRuntime.Gpu 1.17.0) as an option in our C# software. However, when switching from the CPU ONNX runtime to the GPU ONNX runtime we noticed a significant drop in inference speed when still running it using the CPU provider. I know this package is meant to be used for running it with CUDA but we can not expect all of our users to have a CUDA compatible GPU so CPU runtime should still be a viable option for those. I simply fail to understand why there is a 20% increase in CPU inference time when using the same version of ONNX but simply switching to the CUDA compatible package.
To reproduce
Simply run any inference in C# using default CPU execution provider, once using Microsoft.ML.OnnxRuntime 1.17.0 and once using Microsoft.ML.OnnxRuntime.Gpu. The difference was very obvious for me and happened on any model I tried.
Urgency
No response
Platform
Windows
OS Version
Windows 11
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.0
ONNX Runtime API
C#
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
Yes