Open lucasjinreal opened 2 years ago
ORT has TensorRT as its execution provider. It means if you do
execution_providers = ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
session = onnxrunitme.InferenceSession(model_path, sess_options, providers=execution_providers)
TensorRT can be invoked and its unsupported ops will automatically fallback to ORT's CUDA kernels.
@wschin thanks, have u ever tested on a quantized GPT onnx model? I know it have tensorrt provider. Just want make sure does the ops I mentioned above supported or not?
TRT doesn't support dynamic quantization. It supports static quantization with QDQ format. Here is an example: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/nlp/bert/trt. TRT does performance optimization for BERT model, but not for GPT* yet.
@stevenlix
GPT model quantization is planned to be supported in next TRT major release in this year. If you want to try BERT quantization in ORT-TRT, please follow the steps in above link. There are a few other examples for CNN models in the quantization directory. Please don't use optimized models for CPU/CUDA in TRT since TRT has its own optimization approach and fused nodes for CPU/CUDA won't work in TRT.
@stevenlix thanks for the information, that sounds very promising! When would be the next TensorRT major comes out?
TRT doesn't support dynamic quantization. It supports static quantization with QDQ format. Here is an example: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/nlp/bert/trt. TRT does performance optimization for BERT model, but not for GPT* yet.
@stevenlix
Any idea why is that @yufenglee ? Maybe because operations as MatMulInteger
can not be consumed by Tensor RT? Would the PR for QDQ for dynamic quantization https://github.com/microsoft/onnxruntime/pull/12705 allow dynamically quantized models to be used with TRT?
Hi, onnxruntime provides a very useful optimization on transformers models. I can using them convert a 1.2Gb model into a 400M in int8 quantized model.
Is there any plan to support further tensorrt acceleration on inference? Currently seems some ops doesn't support:
Anywhere can get them inference?