Open Hyuto opened 4 months ago
Comparing CPU EP vs CUDA EP, extra time is spent on copying input from CPU to GPU, then copying output from GPU to CPU. The model itself is very cheap to run. See the nsys profiling output (we can see that major time is spent on copying output from GPU to CPU:
If it is a preprocess model, the output shall stay in GPU memory to continue other processing.
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the issue
I'm running inference using onnxruntime with
CUDAExecutionProvider
and turning profiling on, the result is here:As we can see on the image, my model running time takes too long even tough model execution takes only for 555us the
model_run
process take longer time to finish. It didn't happen on onCPUExecutionProvider
which is instantly finish when model execution done (of course with longer time execution).To reproduce
Urgency
No response
Platform
Windows
OS Version
22H2
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU, CUDA
Execution Provider Library Version
CUDA 11.8
Model File
preprocess.zip
Is this a quantized model?
No