Open imzhuhl opened 1 year ago
We recommend to use dynamic quantization for transformer models on CPU. If you use static quant, you can limit the op_to_quantize to MatMul only. https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html#method-selection https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html#transformer-based-models
Thank you for your reply.
"recommend to use dynamic quantization for transformer models on CPU"
What is the reason for this? Is it because of the current support for transformer-based static quantization not good? Or considering the actual situation, the prediction result of dynamic quantification is better.
The reason why I think there is a problem with this static quantization is that the fully connected layer for calculating "query", "key" and "value" is not quantized correctly.
Describe the issue
I am testing the inference performance of a model based on multi-head self attention. After I turn on static quantization, I find that the performance dropped instead. Then, I write a simple test and find that the self-attention graph is strange after static quantization.
Here is the simple reproducd code:
I write a simple self-attn module, and export to onnx model and dyanmic quant model. then I use onnxruntime tools just like:
Then I get static quant model. Finally run all models and get the inference time. static quant model takes the most time.
In my understanding, onnxruntime will optimize graph in session initialization stage. It will use function
TransformGraph
to optimize graph, including fusing DQD nodes. So I print the graph after opimization:and some matmul nodes:
You can find the inputs of matmul node are all fp32 tensor, so I think it is fp32 gemm operation but not int8 gemm.
I have two questions:
Here is the whole graph:
To reproduce
Run the python code.
Urgency
No response
Platform
Linux
OS Version
ubuntu
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
main 61a79436e22892bdd91a905389f12e0aee68132e
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response