Open sean830314 opened 1 month ago
When you generated the QDQ model, did you use
# Generate a suitable quantization configuration for this model.
# Note that we're choosing to use uint16 activations and uint8 weights.
qnn_config = get_qnn_qdq_config(model_to_quantize,
my_data_reader,
activation_type=QuantType.QUInt16, # uint16 activations
weight_type=QuantType.QUInt8) # uint8 weights
# Quantize the model.
quantize(model_to_quantize, output_model_path, qnn_config)
more details refer to https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#running-a-model-with-qnn-eps-htp-backend-python
You can also try the latest nightly build which has fp16 precision enabled by default. python -m pip install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ onnxruntime==1.21.0.dev20241021001
Hi @sean830314,
Have you had a chance to follow the steps suggested by @HectorSVC? Could you please provide an update on your progress?
Thank you.
Thanks. HectorSVC l, ashumish-QCOM
I referred to the following link to modify the quantization code: https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html#running-a-model-with-qnn-eps-htp-backend-python.
Latest Update: I followed the previous suggestions and made adjustments, but the warnings still persist during the quantization process. The same errors regarding tensor type inference continue to appear, with multiple layers being skipped during quantization.
Environment:
Python Version: 3.11.1 (amd64) ONNX Library Versions: onnx: 1.17.0 onnxruntime: 1.19.2 onnxruntime-qnn: 1.19.0 numpy: 1.26.4
qnq_quant.py code:
import argparse
import data_reader
from onnxruntime.quantization import QuantType, quantize
from onnxruntime.quantization.execution_providers.qnn import get_qnn_qdq_config, qnn_preprocess_model
def quantize_model(model_input, model_output):
my_data_reader = data_reader.DataReader(model_input)
preproc_model_path = "model.preproc.onnx"
model_changed = qnn_preprocess_model(model_input, preproc_model_path)
model_to_quantize = preproc_model_path if model_changed else input_model_path
qnn_config = get_qnn_qdq_config(model_to_quantize,
my_data_reader,
activation_type=QuantType.QUInt16, # uint16 activations
weight_type=QuantType.QUInt8) # uint8 weights
# Quantize the model.
quantize(model_to_quantize, model_output, qnn_config)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Quantize an ONNX BERT model dynamically.")
parser.add_argument('--model_input', type=str, required=True, help='Path to the input ONNX model.')
parser.add_argument('--model_output', type=str, required=True, help='Path to save the quantized ONNX model.')
args = parser.parse_args()
quantize_model(args.model_input, args.model_output)
data_reader.py code:
import numpy as np
import onnxruntime
from onnxruntime.quantization import CalibrationDataReader
from transformers import DistilBertTokenizer
class DataReader(CalibrationDataReader):
def __init__(self, model_path: str, tokenizer_name: str = "distilbert-model", max_length: int = 512):
self.enum_data = None
self.tokenizer = DistilBertTokenizer.from_pretrained(tokenizer_name)
self.max_length = max_length
# Use inference session to get input shape.
session = onnxruntime.InferenceSession(model_path, providers=['CPUExecutionProvider'])
inputs = session.get_inputs()
input_names = [inp.name for inp in inputs]
# Generate 10 random text inputs (replace with your calibration dataset)
# TODO: Load valid calibration input text data for your model
example_texts = [
"This is a sample sentence for calibration.",
"Another example text for the calibration process.",
"Text input to verify the calibration process.",
"DistilBERT uses tokenizers to process inputs.",
"Calibration data for optimizing DistilBERT.",
"Random text generation for the calibration task.",
"Using tokenization with the calibration input.",
"Sentence for testing calibration accuracy.",
"Random text for creating input data.",
"Final example of calibration input text."
]
self.data_list = []
for text in example_texts:
# Tokenize the text
tokens = self.tokenizer(
text,
padding="max_length",
truncation=True,
max_length=self.max_length,
return_tensors="np"
)
# Create input data for the model
input_data = {name: tokens[name].astype(np.int64) for name in input_names if name in tokens}
self.data_list.append(input_data)
self.datasize = len(self.data_list)
def get_next(self):
if self.enum_data is None:
self.enum_data = iter(self.data_list)
return next(self.enum_data, None)
def rewind(self):
self.enum_data = None
Executed the quantization script using the command:
python.exe .\qnq_quant.py --model_input .\distilbert-model\model.onnx --model_output model.qnq.onnx
Error Messages:
WARNING:root:Please consider to run pre-processing before quantization. Refer to example: https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/image_classification/cpu/ReadMe.md
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.0/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.0/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.0/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.0/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.1/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.1/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.1/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.1/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.2/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.2/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.2/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.2/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.3/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.3/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.3/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.3/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.4/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.4/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.4/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.4/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.5/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.5/ffn/activation/Mul_1_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.5/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:failed to infer the type of tensor: /distilbert/transformer/layer.5/ffn/lin2/MatMul_output_0. Skip to quantize it. Please check if it is expected.
WARNING:root:Please consider pre-processing before quantization. See https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/image_classification/cpu/ReadMe.md
Hi @sean830314,
Thanks for reporting this issue. It seems the errors are related to node configuration validation failures during inference with the QNNExecutionProvider on the Snapdragon® X Elite NPU.
Here are a few suggestions to troubleshoot:
Pre-processing Before Quantization: Run pre-processing before quantization, as suggested in Error log: Refer to the ONNX Runtime quantization guide for details.
Check Tensor Types: Ensure all tensors have the correct types and shapes before quantization. The warnings indicate some tensor types couldn't be inferred, causing quantization to be skipped.
Review Quantization Configuration: Verify your quantization configuration aligns with the QNNExecutionProvider requirements. Adjust settings or flags as needed.
Update ONNX Runtime: Ensure you're using the latest version of ONNX Runtime, as updates may resolve compatibility issues.
Let us know if it helps.
Thankyou
Hi @ashumish-QCOM
Thank you for your suggestions. I attempted the first step, "Pre-processing Before Quantization," but encountered the following error message:
PS C:\Users\kroos\Desktop\kroos\quantization-distilbert> python -m onnxruntime.quantization.preprocess --input .\distilbert-model\model.onnx --output model-infer.onnx
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "C:\Users\kroos\AppData\Local\Programs\Python\Python311\Lib\site-packages\onnxruntime\quantization\preprocess.py", line 127, in <module>
quant_pre_process(
File "C:\Users\kroos\AppData\Local\Programs\Python\Python311\Lib\site-packages\onnxruntime\quantization\shape_inference.py", line 81, in quant_pre_process
model = SymbolicShapeInference.infer_shapes(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\kroos\AppData\Local\Programs\Python\Python311\Lib\site-packages\onnxruntime\tools\symbolic_shape_infer.py", line 2932, in infer_shapes
raise Exception("Incomplete symbolic shape inference")
Exception: Incomplete symbolic shape inference
This error appears to be related to incomplete symbolic shape inference. I'm not sure if there are any additional settings or parameters that could prevent this error. Do you have any suggestions for resolving it?
Thank you!
Description: When running inference on the distilbert-base-uncased model using the NPU on Snapdragon® X Elite (X1E78100 - Qualcomm®) through ONNX Runtime's QNNExecutionProvider, the model fails to infer. However, the same model runs successfully when using the CPUExecutionProvider. The errors are related to node configuration validation failures within the ONNX model during inference.
Environment:
Device: Snapdragon® X Elite (X1E78100 - Qualcomm®) ONNX Runtime Version: onnxruntime-qnn 1.19.0 Model: distilbert-base-uncased Model Format: Optimized and quantized ONNX model (model_optimized_quantized.onnx) Execution Provider: QNNExecutionProvider Python Version: Python 3.10.11 OS: Windows 11 Code Snippet:
Error Logs: