Open feng-1985 opened 1 year ago
Good question.
If your input data is in CPU (For example, tokenizer is implemented in CPU), you will need copy the input tensor to GPU (either in your application, or by ONNX Runtime).
If your input data is already in GPU, it is recommended to use IO Binding. See https://onnxruntime.ai/docs/api/python/api_summary.html#data-on-device for example.
Thanks for quick response! I get it. Can you help me another problem related to this notebook example ?
How to use the optimized model ? I have seen the following three ways:
first - enable it for debugging only
sess_options.optimized_model_filepath = os.path.join(output_dir, "optimized_model_{}.onnx".format(device_name))
the former refered notebook
second - enable model serialization
sess_options.optimized_model_filepath = "<model_output_path\optimized_model.onnx>"
refer
third - direct used for inference
predictor = ort.InferenceSession(optimized_model,
sess_options=sess_options,
providers=providers)
I am not very understand the former two ways (The third one is our optimize purpose, just optimize the model and use it for inference.).
Thanks for you time !
@feng-1985, For first and second case, the optimized model is for debugging purpose, and it could show you the optimized graph.
Using sess_options.optimized_model_filepath will slow the inference since it involves I/O. For production, you shall avoid saving the optimized model.
The third usage has pros and cons. Pros: It reduces session creation time if you disable graph optimization since it has been optimized. Cons: (1) The optimized model is not portable. That means it might not run well if you changes the device or execution provider (like CUDA -> CPU, or CPU -> CUDA, or CUDA -> TensorRT) (2) It is not compatible in different ONNX runtime version. It might not run in older version of onnx runtime since some optimized operators were just added. It might encounter issue using newer ONNX runtime since we might change some experimental operator in com.microsoft domain and break the backward compatible. Overall, it is not recommended considering the pros and cons.
However, sometimes it is needed to use optimized model when optimization is done by script (like https://onnxruntime.ai/docs/performance/transformers-optimization.html). The optimized graph might be different from the one saved directly from session options.
Thanks for clarify!
Describe the issue
In this notebook, PyTorch_Bert-Squad_OnnxRuntime_GPU, 'OnnxRuntime gpu Inference time = 25.28 ms', if put the data in cuda, the inference time will speed up ?
To reproduce
no
Urgency
No response
Platform
Linux
OS Version
no
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
no
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
No response