TensorRT conversion support on Huggingface transformers quantized models.

microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

https://onnxruntime.ai

MIT License

14.83k stars 2.94k forks source link

TensorRT conversion support on Huggingface transformers quantized models. #10888

Open lucasjinreal opened 2 years ago

lucasjinreal commented 2 years ago

Hi, onnxruntime provides a very useful optimization on transformers models. I can using them convert a 1.2Gb model into a 400M in int8 quantized model.

Is there any plan to support further tensorrt acceleration on inference? Currently seems some ops doesn't support:

DynamicQua │
│ ntizeLinear,Mul,MatMulInteger,FastGelu,SkipLayerNormalization

Anywhere can get them inference?

wschin commented 2 years ago

ORT has TensorRT as its execution provider. It means if you do

execution_providers = ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
session = onnxrunitme.InferenceSession(model_path, sess_options, providers=execution_providers)

TensorRT can be invoked and its unsupported ops will automatically fallback to ORT's CUDA kernels.

lucasjinreal commented 2 years ago

@wschin thanks, have u ever tested on a quantized GPT onnx model? I know it have tensorrt provider. Just want make sure does the ops I mentioned above supported or not?

yufenglee commented 2 years ago

TRT doesn't support dynamic quantization. It supports static quantization with QDQ format. Here is an example: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/nlp/bert/trt. TRT does performance optimization for BERT model, but not for GPT* yet.

@stevenlix

stevenlix commented 2 years ago

GPT model quantization is planned to be supported in next TRT major release in this year. If you want to try BERT quantization in ORT-TRT, please follow the steps in above link. There are a few other examples for CNN models in the quantization directory. Please don't use optimized models for CPU/CUDA in TRT since TRT has its own optimization approach and fused nodes for CPU/CUDA won't work in TRT.

lucasjinreal commented 2 years ago

@stevenlix thanks for the information, that sounds very promising! When would be the next TensorRT major comes out?

fxmarty commented 2 years ago

TRT doesn't support dynamic quantization. It supports static quantization with QDQ format. Here is an example: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/nlp/bert/trt. TRT does performance optimization for BERT model, but not for GPT* yet.

@stevenlix

Any idea why is that @yufenglee ? Maybe because operations as MatMulInteger can not be consumed by Tensor RT? Would the PR for QDQ for dynamic quantization https://github.com/microsoft/onnxruntime/pull/12705 allow dynamically quantized models to be used with TRT?