microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.6k stars 2.77k forks source link

Improve Inference Performance on GPU [Python] #19930

Open Hyuto opened 4 months ago

Hyuto commented 4 months ago

Describe the issue

I'm running inference using onnxruntime with CUDAExecutionProvider and turning profiling on, the result is here:

Screenshot 2024-03-15 111905

As we can see on the image, my model running time takes too long even tough model execution takes only for 555us the model_run process take longer time to finish. It didn't happen on on CPUExecutionProvider which is instantly finish when model execution done (of course with longer time execution).

To reproduce

sess_options = ort.SessionOptions()
sess_options.log_severity_level = 1
preprocess = ort.InferenceSession(
    "preprocess.onnx",
    sess_options=sess_options,
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

img = cv2.imread("IMG_PATH.jpg")
x_input = np.expand_dims(cv2.cvtColor(img, cv2.COLOR_BGR2RGB), 0)

io_binding = preprocess.io_binding()
for _ in range(10):
    io_binding.bind_cpu_input("image", x_input)
    io_binding.bind_output(preprocess.get_outputs()[0].name)
    preprocess.run_with_iobinding(io_binding)
    output = io_binding.copy_outputs_to_cpu()[0]

Urgency

No response

Platform

Windows

OS Version

22H2

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU, CUDA

Execution Provider Library Version

CUDA 11.8

Model File

preprocess.zip

Is this a quantized model?

No

tianleiwu commented 3 months ago

Comparing CPU EP vs CUDA EP, extra time is spent on copying input from CPU to GPU, then copying output from GPU to CPU. The model itself is very cheap to run. See the nsys profiling output (we can see that major time is spent on copying output from GPU to CPU:

image

If it is a preprocess model, the output shall stay in GPU memory to continue other processing.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.