onnxruntime 1.17.0: transformers benchmarking failing for int8 quantized inference.

snadampal commented 7 months ago

Describe the issue

Onnxruntime transformers benchmarking is failing for int8 quantized inference. the same is working fine with onnxruntime 1.16.3. I added the error details below. I found the below commit ( commit c8399a81fed9c114c43daf2103fee48d6b02bdd7: adding support for float16 weights quantization) is causing the break, but the commit already says the feature will not work with onnx 1.15. So, I tried with onnx-weekly (onnx-weekly 1.16.0.dev20240130) but the issue still exist. My question is, which onnx version is required to make this test work again?

Error:

Quantization parameters for tensor:"/embeddings/LayerNorm/Add_1_output_0" not specified
Quantization parameters for tensor:"/encoder/layer.0/attention/self/Reshape_3_output_0" not specified
Exception
Traceback (most recent call last):
  File "/home/ubuntu/sn_onnxruntime/onnxruntime/onnxruntime/python/tools/transformers/benchmark.py", line 902, in main
    results += run_onnxruntime(
  File "/home/ubuntu/sn_onnxruntime/onnxruntime/onnxruntime/python/tools/transformers/benchmark.py", line 160, in run_onnxruntime
    ) = export_onnx_model_from_pt(
  File "/home/ubuntu/sn_onnxruntime/onnxruntime/onnxruntime/python/tools/transformers/onnx_exporter.py", line 550, in export_onnx_model_from_pt
    onnx_model_file, is_valid_onnx_model, vocab_size = validate_and_optimize_onnx(
  File "/home/ubuntu/sn_onnxruntime/onnxruntime/onnxruntime/python/tools/transformers/onnx_exporter.py", line 439, in validate_and_optimize_onnx
    QuantizeHelper.quantize_onnx_model(onnx_model_path, onnx_model_path, use_external_data_format)
  File "/home/ubuntu/sn_onnxruntime/onnxruntime/onnxruntime/python/tools/transformers/quantize_helper.py", line 68, in quantize_onnx_model
    quantize_dynamic(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onnxruntime/quantization/quantize.py", line 642, in quantize_dynamic
    quantizer.quantize_model()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 403, in quantize_model
    op_quantizer.quantize()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onnxruntime/quantization/operators/matmul.py", line 78, in quantize
    otype = self.quantizer.get_tensor_type(node.output[0], mandatory=True)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 461, in get_tensor_type
    raise RuntimeError(f"Unable to find data type for weight_name={tensor_name!r}")
RuntimeError: Unable to find data type for weight_name='/encoder/layer.0/attention/output/dense/MatMul_output_0'

Commit that introduced this:

commit c8399a81fed9c114c43daf2103fee48d6b02bdd7
Author: Xavier Dupré <xadupre@users.noreply.github.com>
Date:   Fri Jan 12 17:54:55 2024 +0100

    Quantization tool: support float 8 with MatMul, support float 16 weights (#18043)

    ### Description

    Whenever a node QuantizeLinear or DequantizeLinear, the type of the
    weights before being quantize must be known to create the scale with the
    expected type. Another option would be to add many operator CastLike but
    that would push the burden to onnxruntime optimizer.

    The PR tries to avoid changing the signature. To do so, it modified the
    scale computation to use a numpy array to store the result and not a
    python float. The numpy array must be of the same type than the weights
    to quantize.

    The PR adds many `assert` to check the type of the scale is not a python
    type or a float64. This was added to make sure all the code follows the
    same logic. These lines were kept for the first review.

    DequantizeLinear, QuantizeLinear cannot be tested with onnx==1.15. PR
    https://github.com/onnx/onnx/pull/5709 is missing to fix shape
    inference. PR https://github.com/onnx/onnx/pull/5473) is missing to
    support QLinearMatMul with float 16. That explains why some tests are
    disabled with float 16.

    ### Motivation and Context

    The current quantization tool assumes every weight is float 32. For
    large models such as LLAMA, it is usually float 16. The quantization
    needs to quantize such weights.

To reproduce

pip install onnxruntime==1.17.0
pip install onnx or pip install onnx-weekly

git clone https://github.com/microsoft/onnxruntime.git
cd onnxruntime
git submodule sync
git submodule update --init --recursive
cd onnxruntime/python/tools/transformers
python3 benchmark.py -p int8 -m bert-base-cased

Urgency

it is actually blocking new features development and testing, because the tests are no longer working.

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.0

ONNX Runtime API

Python

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

xadupre commented 7 months ago

The script is quantizing a model once and producing a quantized model. Then it starts to quantize again with the quantized model. Because this model contains operator from domain com.microsoft, shape_inference can infer the shape for any node past the first com.microsoft operator. Before PR #18043, it was not an issue as the type to quantize was always float. Now it can be float16 as well, this information is needed. I can think of two fixes, use symbolic shape inference implemented in onnxruntime assuming it supports the nodes from domain com.microsoft or use a default type infered from whatever the regular shape inference is given as information (I would probably take the most frequent float type among the available types).

snadampal commented 7 months ago

Hi @xadupre , thanks for looking into it. would like to know if any chance of targeting the fix for onnxruntime 1.17.1 milestone.

xadupre commented 7 months ago

PR #19455 would let you define a default type. I modified the code to pick TensorProto.FLOAT since all the code in that subfolder was implemented assuming models were using this type.

snadampal commented 7 months ago

thanks @xadupre , the PR fixes the fp32 to int8 quantize/de-quantize issue.

microsoft / onnxruntime