Closed michaelroyzen closed 5 months ago
Despite using a TensorRT config and force_symmetric_range=True
, it seems the weights are still asymmetrical @fxmarty.
@michaelroyzen I can reproduce the issue. Could you inspect your model with netron to inspect the node /layers.0/self_attn/Add_2_output_0_QuantizeLinear
?
Trying on a dummy model (https://huggingface.co/fxmarty/tiny-random-GemmaForCausalLM), it appears that at least one node is not quantized properly, but others are fine:
I'm not sure whether this is due to using a dummy model, a very small dataset or a bug in ORT quantization. But we have the issue at the same point in the network, so I suspect a bug.
Can you try:
qconfig.operators_to_quantize = ['Conv', 'ConvTranspose', 'Gemm', 'Clip', 'Relu', 'Reshape', 'Transpose', 'Squeeze', 'Unsqueeze', 'Resize', 'MaxPool', 'AveragePool', 'MatMul', 'Split', 'Gather', 'Where', 'InstanceNormalization', 'LayerNormalization']
(or a subset of that - simply removing "Softmax"
)
Also maybe useful to you (for quantization quality): https://github.com/huggingface/optimum/blob/bb21ae7f7d572805f6ecdea8e0f02dc6014d57e8/examples/onnxruntime/quantization/text-classification/run_glue.py#L467-L476
After that all quantization nodes are symmetric, but you may get an other error:
[02/29/2024-10:03:51] [V] [TRT] After concat removal: 18 layers
[02/29/2024-10:03:51] [V] [TRT] Trying to split Reshape and strided tensor
[02/29/2024-10:03:51] [I] [TRT] Graph optimization time: 1.42899 seconds.
[02/29/2024-10:03:51] [V] [TRT] Building graph using backend strategy 2
[02/29/2024-10:03:51] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[02/29/2024-10:03:51] [V] [TRT] Constructing optimization profile number 0 [1/1].
[02/29/2024-10:03:51] [V] [TRT] Applying generic optimizations to the graph for inference.
[02/29/2024-10:03:51] [E] Error[2]: Assertion !n->candidateRequirements.empty() failed. No supported formats for /model/layers.0/self_attn/rotary_emb/Unsqueeze_1
[02/29/2024-10:03:51] [E] Error[2]: [optimizer.cpp::getFormatRequirements::3154] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. No supported formats for /model/layers.0/self_attn/rotary_emb/Unsqueeze_1)
[02/29/2024-10:03:51] [E] Engine could not be created from network
[02/29/2024-10:03:51] [E] Building engine failed
[02/29/2024-10:03:51] [E] Failed to create engine from model or file.
[02/29/2024-10:03:51] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model_quantized.onnx --saveEngine=model.plan --minShapes=input_ids:1x400,attention_mask:1x400,position_ids:1x400 --optShapes=input_ids:16x400,attention_mask:16x400,position_ids:16x400 --maxShapes=input_ids:32x400,attention_mask:32x400,position_ids:32x400 --verbose --int8
(I tried exporting the model with int32 instead of int64 but that does not help :/)
By the way, if you want to use past key values, you would need to do calibration using dummy ones (of shape [batch_size, num_heads, 0, head_dim]). Otherwise the model you are using right now does not use KV cache.
Have you considered using TRT-LLM / https://github.com/huggingface/optimum-nvidia ?
Same error on nvcr.io/nvidia/tensorrt:24.01-py3
(the Assertion !n->candidateRequirements.empty() failed. No supported formats for /model/layers.0/self_attn/rotary_emb/Unsqueeze_1
). Sounds like a bug in TRT.
Thank you for looking into this. I think the issue could be in onnxruntime @fxmarty. When setting qconfig.operators_to_quantize = ['MatMul']
,
I get
File [~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:417](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:417), in ORTQuantizer.quantize(self, quantization_config, save_dir, file_suffix, calibration_tensors_range, use_external_data_format, preprocessor)
[389](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:389) quantizer = quantizer_factory(
[390](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:390) model=onnx_model,
[391](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:391) static=quantization_config.is_static,
(...)
[413](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:413) },
[414](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:414) )
[416](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:416) LOGGER.info("Quantizing model...")
--> [417](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:417) quantizer.quantize_model()
[419](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:419) suffix = f"_{file_suffix}" if file_suffix else ""
[420](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:420) quantized_model_path = save_dir.joinpath(f"{self.onnx_model_path.stem}{suffix}").with_suffix(".onnx")
File [~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:263](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:263), in QDQQuantizer.quantize_model(self)
[260](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:260) self.tensor_to_its_receiving_nodes[tensor_name] = []
[261](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:261) self.tensor_to_its_receiving_nodes[tensor_name].append(node)
--> [263](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:263) self._quantize_normal_tensors()
[264](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:264) self._quantize_sharing_param_tensors()
[265](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:265) if self.quantize_bias:
File [~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:454](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:454), in QDQQuantizer._quantize_normal_tensors(self)
[449](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:449) data_found, scale_name, zp_name, _, _ = self._get_quantization_params(
[450](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:450) tensor_name, used_scale, used_zp
[451](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:451) )
[453](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:453) if not data_found:
--> [454](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:454) raise ValueError(
[455](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:455) f"Quantization parameters are not specified for param {tensor_name}. "
[456](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:456) "In static mode quantization params for inputs and outputs of nodes to be quantized are required."
[457](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:457) )
[459](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:459) self._add_qdq_pair_for_activation(tensor_name, scale_name, zp_name, data_type=tensor_info.data_type)
[461](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:461) del self.tensors_to_quantize[tensor_name]
ValueError: Quantization parameters are not specified for param [/layers.0/self_attn/Add_2_output_0.](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/layers.0/self_attn/Add_2_output_0.) In static mode quantization params for inputs and outputs of nodes to be quantized are required.
That node, Add_2_output_0
, is the same node that has had problems with compilation in TensorRT, as per my initial comment.
We do use TRT-LLM for generation, but our goal here is to use Mistral/Gemma for embeddings. TRT-LLM only returns logits, not the last hidden states pre-projection.
@michaelroyzen were you able to confirm that the issue is fixed ? as reported in https://github.com/NVIDIA/TensorRT/issues/3688
Can't confirm yet as TensorRT 10.0 has not been released yet. @IlyasMoutawwakil
Closing as it is a TRT bug.
@michaelroyzen Hi. Is your static quantized model working normally? I tried quantizing MiniCPM model using similar approach, however it just generate some garbage.
hi @michaelroyzen why do you use ORTQuantizer? cant you just quantize the model using trtexec?
System Info
Who can help?
@fxmarty
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
I'm trying to export Gemma/Mistral models from HF to be used in TensorRT in INT8. While the ONNX export completes successfully, I get errors from the TensorRT
trtexec
conversion about symmetric quantization despite enabling those flags in Optimum.To reproduce, first export a Gemma model to ONNX with static INT8 quantization:
Then, inside the
nvcr.io/nvidia/tensorrt:23.10-py3
container, we build the TRT engine:which yields the following error:
It seems like the weights aren't symmetric after all. Would appreciate your help @fxmarty! Thanks in advance.
Expected behavior
The weights are symmetric and the TensorRT engine successfully builds with INT8.