Static quantization for transformers models doesn't work

ofirzaf commented 3 years ago

Describe the bug I tried to perform static quantization (not dynamic) for a transformer model based on your guide for quantizing BERT and I got the following error.

ValueError: Quantization parameters are not specified for param 310.In static mode quantization params for inputs and outputs of nodes to be quantized are required.

System information

OS Platform and Distribution Linux Ubuntu 18.04:
ONNX Runtime installed from binary:
ONNX Runtime version: 1.8.1
onnxruntime_tools version: 1.7.0
Python version: 3.7.5
PyTorch version: 1.8.1+cu111
Transformers version: 4.6.1
Jupyter version:

To Reproduce I used the following code.
Note that the onnx model and optimized model generated using the export and optimizer functions are producing the expected result when running them using onnxruntime.

import torch
from transfromers import AutoModelForQuestionAnswering
import os
from onnxruntime.quantization import quantize_static

model_path = '<Path to model directory>'
model_trans = AutoModelForQuestionAnswering.from_pretrained(model_path)

def export_onnx_model(model, onnx_model_path):
    with torch.no_grad():
        inputs = {'input_ids':      torch.ones(1,512, dtype=torch.int64),
                    'attention_mask': torch.ones(1,512, dtype=torch.int64),
                    'token_type_ids': torch.ones(1,512, dtype=torch.int64)}
        outputs = model(**inputs)

        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
        torch.onnx.export(model,                                            # model being run
                    (inputs['input_ids'],                             # model input (or a tuple for multiple inputs)
                    inputs['attention_mask'], 
                    inputs['token_type_ids']),                                         # model input (or a tuple for multiple inputs)
                    onnx_model_path,                                # where to save the model (can be a file or file-like object)
                    opset_version=11,                                 # the ONNX version to export the model to
                    do_constant_folding=True,                         # whether to execute constant folding for optimization
                    input_names=['input_ids',                         # the model's input names
                                'attention_mask', 
                                'token_type_ids'],
                    output_names=['output0', 'output1'],                    # the model's output names
                    dynamic_axes={'input_ids': symbolic_names,        # variable length axes
                                'attention_mask' : symbolic_names,
                                'token_type_ids' : symbolic_names,
                                 'output0': symbolic_names,
                                 'output1': symbolic_names})
#         logger.info("ONNX Model exported to {0}".format(onnx_model_path))

model_fp32 = os.path.join(model_path, 'onnx', 'bert.onnx')
export_onnx_model(model_trans, model_fp32)

from onnxruntime_tools import optimizer
from onnxruntime_tools.transformers.onnx_model_bert import BertOptimizationOptions

opt_options = BertOptimizationOptions('bert')
opt_options.enable_embed_layer_norm = False
opt_model = optimizer.optimize_model(
    model_fp32,
    'bert', 
    num_heads=12,
    hidden_size=768,
    optimization_options=opt_options)

from onnxruntime.quantization import CalibrationDataReader

class SquadCalibrationDataReader(CalibrationDataReader):
    def __init__(self, batch_size=1):
        self._iter = iter(get_eval_dataloader())  # the function returns a pytorch dataloader for SQuAD evaluation data

    def get_next(self):
            return next(self._iter, None)

model_opt = os.path.join(model_path, 'onnx', 'bert.opt.onnx')
opt_model.save_model_to_file(model_opt)
model_static = os.path.join(model_path, 'onnx', 'bert.opt.static.onnx')
quantize_static(model_opt, model_static, SquadCalibrationDataReader())

Running this code produced the following traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_18446/604177838.py in <module>
     46 opt_model.save_model_to_file(model_opt)
     47 model_static = os.path.join(model_path, 'onnx', 'bert.opt.static.onnx')
---> 48 quantize_static(model_opt, model_static, SquadCalibrationDataReader())

/venv/lib/python3.7/site-packages/onnxruntime/quantization/quantize.py in quantize_static(model_input, model_output, calibration_data_reader, quant_format, op_types_to_quantize, per_channel, reduce_range, activation_type, weight_type, nodes_to_quantize, nodes_to_exclude, optimize_model, use_external_data_format, calibrate_method, extra_options)
    228             extra_options)
    229 
--> 230     quantizer.quantize_model()
    231     quantizer.model.save_model_to_file(model_output, use_external_data_format)
    232 

/venv/lib/python3.7/site-packages/onnxruntime/quantization/onnx_quantizer.py in quantize_model(self)
    206                 op_quantizer = CreateDefaultOpQuantizer(self, node)
    207 
--> 208             op_quantizer.quantize()
    209             for i in range(number_of_existing_new_nodes, len(self.new_nodes)):
    210                 for output_name in self.new_nodes[i].output:

/venv/lib/python3.7/site-packages/onnxruntime/quantization/operators/matmul.py in quantize(self)
     67 
     68         (quantized_input_names, zero_point_names, scale_names, nodes) = \
---> 69             self.quantizer.quantize_inputs(node, [0, 1], reduce_range=True, op_level_per_channel=True)
     70 
     71         data_found, output_scale_name, output_zp_name, _, _ = \

/venv/lib/python3.7/site-packages/onnxruntime/quantization/onnx_quantizer.py in quantize_inputs(self, node, indices, initializer_use_weight_qType, reduce_range, op_level_per_channel, axis)
    648                                                             self.model.graph())
    649                 if qlinear_node is None:
--> 650                     quantize_input_nodes = self._get_quantize_input_nodes(node, input_index, self.input_qType)
    651                     nodes.extend(quantize_input_nodes)
    652                     qlinear_node = quantize_input_nodes[-1]

/venv/lib/python3.7/site-packages/onnxruntime/quantization/onnx_quantizer.py in _get_quantize_input_nodes(self, node, input_index, qType, given_scale_name, given_zp_name)
    445                     "Quantization parameters are not specified for param {}."
    446                     "In static mode quantization params for inputs and outputs of nodes to be quantized are required.".
--> 447                     format(input_name))
    448             # dynamic mode
    449             # Scale and Zero Points not available for this input. Add nodes to dynamically compute it

ValueError: Quantization parameters are not specified for param 310.In static mode quantization params for inputs and outputs of nodes to be quantized are required.

Expected behavior The function quantize_static should save a quantized onnx model in the stated path.

yufenglee commented 3 years ago

optimizer.optimize_model fused subgraph to customized operators. ONNX shape inference doesn't work for the customized operators, so calibration tool can not generate quantization parameter for it. PR #8788 can solve the error.

If you want to try static quantization on transformer models, it'd better to not use Optimizer.optimize_model to fuse the model or only few ops will be quantized because we don't have static quantization support for the fused ops.

yufenglee commented 3 years ago

Fixed with #8788 .

microsoft / onnxruntime

Static quantization for transformers models doesn't work #8770