huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.46k stars 436 forks source link

[BUG] Cannot export Gemma/Mistral to ONNX/TensorRT using INT8 #1735

Closed michaelroyzen closed 5 months ago

michaelroyzen commented 6 months ago

System Info

Optimum main branch, commit bb21ae7f7d572805f6ecdea8e0f02dc6014d57e8
Transformers 4.38.1
OnnxRuntime 1.17.1
PyTorch 2.2.1
TensorRT 8.6.1 (nvcr.io/nvidia/tensorrt:23.10-py3) container

Who can help?

@fxmarty

Information

Tasks

Reproduction (minimal, reproducible, runnable)

I'm trying to export Gemma/Mistral models from HF to be used in TensorRT in INT8. While the ONNX export completes successfully, I get errors from the TensorRT trtexec conversion about symmetric quantization despite enabling those flags in Optimum.

To reproduce, first export a Gemma model to ONNX with static INT8 quantization:

from optimum.onnxruntime import AutoQuantizationConfig, ORTModelForFeatureExtraction, AutoCalibrationConfig, ORTQuantizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
onnx_model = ORTModelForFeatureExtraction.from_pretrained("google/gemma-2b", from_transformers=True)
quantizer = ORTQuantizer.from_pretrained(onnx_model)

from functools import partial
import torch

def preprocess_fn(ex, tokenizer):
    encoded_inputs = tokenizer(ex["func_documentation_string"], return_tensors="pt", padding=True)
    position_ids = torch.clamp(torch.cumsum(encoded_inputs['attention_mask'], dim=-1) - 1, min=0)
    # Convert to appropriate tensor type if using PyTorch
    position_ids_tensor = torch.LongTensor(position_ids)
    encoded_inputs['position_ids'] = position_ids_tensor

    return encoded_inputs

calibration_dataset = quantizer.get_calibration_dataset(
    "code_search_net",
    preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
    num_samples=3,
    dataset_split="train",
)

qconfig = AutoQuantizationConfig.tensorrt(per_channel=False)

calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)
ranges = quantizer.fit(
    dataset=calibration_dataset,
    calibration_config=calibration_config,
    operators_to_quantize=qconfig.operators_to_quantize,
    use_external_data_format=True,
    batch_size=1,
    force_symmetric_range=True,
)

model_quantized_path = quantizer.quantize(
    save_dir="onnx-tensorrt/",
    calibration_tensors_range=ranges,
    quantization_config=qconfig,
    use_external_data_format=True,
)

Then, inside the nvcr.io/nvidia/tensorrt:23.10-py3 container, we build the TRT engine:

trtexec \
    --onnx=onnx-tensorrt/model_quantized.onnx \
    --saveEngine=model.plan \
    --minShapes=input_ids:1x512,attention_mask:1x512,position_ids:1x512 \
    --optShapes=input_ids:16x512,attention_mask:16x512,position_ids:16x512 \
    --maxShapes=input_ids:32x512,attention_mask:32x512,position_ids:32x512 \
    --verbose \
    --workspace=768000 \
    --int8

which yields the following error:

[02/29/2024-08:06:34] [V] [TRT] Parsing node: /layers.0/self_attn/Add_2_output_0_QuantizeLinear [QuantizeLinear]
[02/29/2024-08:06:34] [V] [TRT] Searching for input: /layers.0/self_attn/Add_2_output_0
[02/29/2024-08:06:34] [V] [TRT] Searching for input: /layers.0/self_attn/Add_2_output_0_scale
[02/29/2024-08:06:34] [V] [TRT] Searching for input: /layers.0/self_attn/Add_2_output_0_zero_point
[02/29/2024-08:06:34] [V] [TRT] /layers.0/self_attn/Add_2_output_0_QuantizeLinear [QuantizeLinear] inputs: [/layers.0/self_attn/Add_2_output_0 -> (-1, 8, -1, -1)[FLOAT]], [/layers.0/self_attn/Add_2_output_0_scale -> ()[FLOAT]], [/layers.0/self_attn/Add_2_output_0_zero_point -> ()[INT8]], 
[02/29/2024-08:06:34] [V] [TRT] Registering layer: /layers.0/self_attn/Add_2_output_0_scale for ONNX node: /layers.0/self_attn/Add_2_output_0_scale
[02/29/2024-08:06:34] [E] [TRT] ModelImporter.cpp:771: While parsing node number 2211 [QuantizeLinear -> "/layers.0/self_attn/Add_2_output_0_QuantizeLinear_Output"]:
[02/29/2024-08:06:34] [E] [TRT] ModelImporter.cpp:772: --- Begin node ---
[02/29/2024-08:06:34] [E] [TRT] ModelImporter.cpp:773: input: "/layers.0/self_attn/Add_2_output_0"
input: "/layers.0/self_attn/Add_2_output_0_scale"
input: "/layers.0/self_attn/Add_2_output_0_zero_point"
output: "/layers.0/self_attn/Add_2_output_0_QuantizeLinear_Output"
name: "/layers.0/self_attn/Add_2_output_0_QuantizeLinear"
op_type: "QuantizeLinear"

[02/29/2024-08:06:34] [E] [TRT] ModelImporter.cpp:774: --- End node ---
[02/29/2024-08:06:34] [E] [TRT] ModelImporter.cpp:777: ERROR: builtin_op_importers.cpp:1221 In function QuantDequantLinearHelper:
[6] Assertion failed: shiftIsAllZeros(zeroPoint) && "TensorRT only supports symmetric quantization. The zero point for the QuantizeLinear/DequantizeLinear operator must be all zeros."
[02/29/2024-08:06:34] [E] Failed to parse onnx file
[02/29/2024-08:06:35] [I] Finished parsing network model. Parse time: 15.6036
[02/29/2024-08:06:35] [E] Parsing model failed
[02/29/2024-08:06:35] [E] Failed to create engine from model or file.
[02/29/2024-08:06:35] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=onnx-tensorrt/model_quantized.onnx --saveEngine=model.plan --minShapes=input_ids:1x512,attention_mask:1x512,position_ids:1x512 --optShapes=input_ids:16x512,attention_mask:16x512,position_ids:16x512 --maxShapes=input_ids:32x512,attention_mask:32x512,position_ids:32x512 --verbose --workspace=768000 --int8

It seems like the weights aren't symmetric after all. Would appreciate your help @fxmarty! Thanks in advance.

Expected behavior

The weights are symmetric and the TensorRT engine successfully builds with INT8.

michaelroyzen commented 6 months ago

Despite using a TensorRT config and force_symmetric_range=True, it seems the weights are still asymmetrical @fxmarty.

fxmarty commented 6 months ago

@michaelroyzen I can reproduce the issue. Could you inspect your model with netron to inspect the node /layers.0/self_attn/Add_2_output_0_QuantizeLinear?

Trying on a dummy model (https://huggingface.co/fxmarty/tiny-random-GemmaForCausalLM), it appears that at least one node is not quantized properly, but others are fine: image

I'm not sure whether this is due to using a dummy model, a very small dataset or a bug in ORT quantization. But we have the issue at the same point in the network, so I suspect a bug.

fxmarty commented 6 months ago

Can you try:

qconfig.operators_to_quantize = ['Conv', 'ConvTranspose', 'Gemm', 'Clip', 'Relu', 'Reshape', 'Transpose', 'Squeeze', 'Unsqueeze', 'Resize', 'MaxPool', 'AveragePool', 'MatMul', 'Split', 'Gather', 'Where', 'InstanceNormalization', 'LayerNormalization']

(or a subset of that - simply removing "Softmax")

Also maybe useful to you (for quantization quality): https://github.com/huggingface/optimum/blob/bb21ae7f7d572805f6ecdea8e0f02dc6014d57e8/examples/onnxruntime/quantization/text-classification/run_glue.py#L467-L476

After that all quantization nodes are symmetric, but you may get an other error:

[02/29/2024-10:03:51] [V] [TRT] After concat removal: 18 layers
[02/29/2024-10:03:51] [V] [TRT] Trying to split Reshape and strided tensor
[02/29/2024-10:03:51] [I] [TRT] Graph optimization time: 1.42899 seconds.
[02/29/2024-10:03:51] [V] [TRT] Building graph using backend strategy 2
[02/29/2024-10:03:51] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[02/29/2024-10:03:51] [V] [TRT] Constructing optimization profile number 0 [1/1].
[02/29/2024-10:03:51] [V] [TRT] Applying generic optimizations to the graph for inference.
[02/29/2024-10:03:51] [E] Error[2]: Assertion !n->candidateRequirements.empty() failed. No supported formats for /model/layers.0/self_attn/rotary_emb/Unsqueeze_1
[02/29/2024-10:03:51] [E] Error[2]: [optimizer.cpp::getFormatRequirements::3154] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. No supported formats for /model/layers.0/self_attn/rotary_emb/Unsqueeze_1)
[02/29/2024-10:03:51] [E] Engine could not be created from network
[02/29/2024-10:03:51] [E] Building engine failed
[02/29/2024-10:03:51] [E] Failed to create engine from model or file.
[02/29/2024-10:03:51] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=model_quantized.onnx --saveEngine=model.plan --minShapes=input_ids:1x400,attention_mask:1x400,position_ids:1x400 --optShapes=input_ids:16x400,attention_mask:16x400,position_ids:16x400 --maxShapes=input_ids:32x400,attention_mask:32x400,position_ids:32x400 --verbose --int8

(I tried exporting the model with int32 instead of int64 but that does not help :/)

By the way, if you want to use past key values, you would need to do calibration using dummy ones (of shape [batch_size, num_heads, 0, head_dim]). Otherwise the model you are using right now does not use KV cache.

Have you considered using TRT-LLM / https://github.com/huggingface/optimum-nvidia ?

fxmarty commented 6 months ago

Same error on nvcr.io/nvidia/tensorrt:24.01-py3 (the Assertion !n->candidateRequirements.empty() failed. No supported formats for /model/layers.0/self_attn/rotary_emb/Unsqueeze_1). Sounds like a bug in TRT.

michaelroyzen commented 6 months ago

Thank you for looking into this. I think the issue could be in onnxruntime @fxmarty. When setting qconfig.operators_to_quantize = ['MatMul'],

I get

File [~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:417](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:417), in ORTQuantizer.quantize(self, quantization_config, save_dir, file_suffix, calibration_tensors_range, use_external_data_format, preprocessor)
    [389](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:389)     quantizer = quantizer_factory(
    [390](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:390)         model=onnx_model,
    [391](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:391)         static=quantization_config.is_static,
   (...)
    [413](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:413)         },
    [414](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:414)     )
    [416](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:416) LOGGER.info("Quantizing model...")
--> [417](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:417) quantizer.quantize_model()
    [419](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:419) suffix = f"_{file_suffix}" if file_suffix else ""
    [420](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/optimum/onnxruntime/quantization.py:420) quantized_model_path = save_dir.joinpath(f"{self.onnx_model_path.stem}{suffix}").with_suffix(".onnx")

File [~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:263](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:263), in QDQQuantizer.quantize_model(self)
    [260](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:260)                     self.tensor_to_its_receiving_nodes[tensor_name] = []
    [261](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:261)                 self.tensor_to_its_receiving_nodes[tensor_name].append(node)
--> [263](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:263) self._quantize_normal_tensors()
    [264](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:264) self._quantize_sharing_param_tensors()
    [265](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:265) if self.quantize_bias:

File [~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:454](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:454), in QDQQuantizer._quantize_normal_tensors(self)
    [449](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:449)     data_found, scale_name, zp_name, _, _ = self._get_quantization_params(
    [450](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:450)         tensor_name, used_scale, used_zp
    [451](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:451)     )
    [453](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:453)     if not data_found:
--> [454](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:454)         raise ValueError(
    [455](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:455)             f"Quantization parameters are not specified for param {tensor_name}. "
    [456](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:456)             "In static mode quantization params for inputs and outputs of nodes to be quantized are required."
    [457](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:457)         )
    [459](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:459)     self._add_qdq_pair_for_activation(tensor_name, scale_name, zp_name, data_type=tensor_info.data_type)
    [461](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/home/ubuntu/monorepo/infrastructure/triton-embeddings/~/.local/lib/python3.8/site-packages/onnxruntime/quantization/qdq_quantizer.py:461) del self.tensors_to_quantize[tensor_name]

ValueError: Quantization parameters are not specified for param [/layers.0/self_attn/Add_2_output_0.](https://vscode-remote+ssh-002dremote-002bec2-002d18-002d232-002d85-002d159-002ecompute-002d1-002eamazonaws-002ecom.vscode-resource.vscode-cdn.net/layers.0/self_attn/Add_2_output_0.) In static mode quantization params for inputs and outputs of nodes to be quantized are required.

That node, Add_2_output_0, is the same node that has had problems with compilation in TensorRT, as per my initial comment.

michaelroyzen commented 6 months ago

We do use TRT-LLM for generation, but our goal here is to use Mistral/Gemma for embeddings. TRT-LLM only returns logits, not the last hidden states pre-projection.

IlyasMoutawwakil commented 6 months ago

@michaelroyzen were you able to confirm that the issue is fixed ? as reported in https://github.com/NVIDIA/TensorRT/issues/3688

michaelroyzen commented 6 months ago

Can't confirm yet as TensorRT 10.0 has not been released yet. @IlyasMoutawwakil

fxmarty commented 5 months ago

Closing as it is a TRT bug.

CHNtentes commented 4 months ago

@michaelroyzen Hi. Is your static quantized model working normally? I tried quantizing MiniCPM model using similar approach, however it just generate some garbage.

geraldstanje commented 3 months ago

hi @michaelroyzen why do you use ORTQuantizer? cant you just quantize the model using trtexec?