microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.54k stars 2.91k forks source link

why the input doesn't place in cuda ? #16225

Open feng-1985 opened 1 year ago

feng-1985 commented 1 year ago

Describe the issue

In this notebook, PyTorch_Bert-Squad_OnnxRuntime_GPU, 'OnnxRuntime gpu Inference time = 25.28 ms', if put the data in cuda, the inference time will speed up ?

To reproduce

no

Urgency

No response

Platform

Linux

OS Version

no

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

no

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

tianleiwu commented 1 year ago

Good question.

If your input data is in CPU (For example, tokenizer is implemented in CPU), you will need copy the input tensor to GPU (either in your application, or by ONNX Runtime).

If your input data is already in GPU, it is recommended to use IO Binding. See https://onnxruntime.ai/docs/api/python/api_summary.html#data-on-device for example.

feng-1985 commented 1 year ago

Thanks for quick response! I get it. Can you help me another problem related to this notebook example ?

How to use the optimized model ? I have seen the following three ways: first - enable it for debugging only sess_options.optimized_model_filepath = os.path.join(output_dir, "optimized_model_{}.onnx".format(device_name)) the former refered notebook second - enable model serialization sess_options.optimized_model_filepath = "<model_output_path\optimized_model.onnx>" refer third - direct used for inference

predictor = ort.InferenceSession(optimized_model,
                                                 sess_options=sess_options,
                                                 providers=providers)

I am not very understand the former two ways (The third one is our optimize purpose, just optimize the model and use it for inference.).

Thanks for you time !

tianleiwu commented 1 year ago

@feng-1985, For first and second case, the optimized model is for debugging purpose, and it could show you the optimized graph.

Using sess_options.optimized_model_filepath will slow the inference since it involves I/O. For production, you shall avoid saving the optimized model.

The third usage has pros and cons. Pros: It reduces session creation time if you disable graph optimization since it has been optimized. Cons: (1) The optimized model is not portable. That means it might not run well if you changes the device or execution provider (like CUDA -> CPU, or CPU -> CUDA, or CUDA -> TensorRT) (2) It is not compatible in different ONNX runtime version. It might not run in older version of onnx runtime since some optimized operators were just added. It might encounter issue using newer ONNX runtime since we might change some experimental operator in com.microsoft domain and break the backward compatible. Overall, it is not recommended considering the pros and cons.

However, sometimes it is needed to use optimized model when optimization is done by script (like https://onnxruntime.ai/docs/performance/transformers-optimization.html). The optimized graph might be different from the one saved directly from session options.

feng-1985 commented 1 year ago

Thanks for clarify!