Quantization failed for transformers m2m100

bugzyz commented 1 year ago

System Info

optimum: 1.5.2
python: Python 3.8.10
docker image: nvcr.io/nvidia/tensorrt:22.07-py3

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Adding the fp16=True in OptimizationConfig will cause the optimization failed (Exception: Incomplete symbolic shape inference)

from optimum.onnxruntime import ORTOptimizer, ORTModelForSeq2SeqLM
from optimum.onnxruntime.configuration import OptimizationConfig

model_id = "facebook/m2m100_418M"
optimized_onnx_save_dir = "m2m100_optimized"

def export_optimized_onnx():
    # Load a PyTorch model and export it to the ONNX format
    model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)

    # Create the optimizer
    optimizer = ORTOptimizer.from_pretrained(model)

    # Define the optimization strategy by creating the appropriate configuration
    optimization_config = OptimizationConfig(
        optimization_level=2,
        optimize_with_onnxruntime_only=False,
        optimize_for_gpu=True,
        fp16=True
    )

    # Optimize the model
    optimizer.optimize(save_dir=optimized_onnx_save_dir, optimization_config=optimization_config)

if __name__ == '__main__':
    export_optimized_onnx()

error message:

...
/usr/local/lib/python3.8/dist-packages/transformers/models/m2m_100/modeling_m2m_100.py:82: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min))
/usr/local/lib/python3.8/dist-packages/transformers/models/m2m_100/modeling_m2m_100.py:87: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if past_key_values_length > 0:
/usr/local/lib/python3.8/dist-packages/optimum/onnxruntime/configuration.py:702: FutureWarning: optimize_with_onnxruntime_only will be deprecated soon, use enable_transformers_specific_optimizations instead, enable_transformers_specific_optimizations is set to True.
  warnings.warn(
2022-12-20 11:39:22.016166822 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2022-12-20 11:39:22.016285892 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
symbolic shape inference disabled or failed.
Traceback (most recent call last):
  File "quantize_m2m100.py", line 28, in <module>
    export_optimized_onnx()
  File "quantize_m2m100.py", line 24, in export_optimized_onnx
    optimizer.optimize(save_dir=optimized_onnx_save_dir, optimization_config=optimization_config)
  File "/usr/local/lib/python3.8/dist-packages/optimum/onnxruntime/optimization.py", line 151, in optimize
    optimizer.convert_float_to_float16(keep_io_types=True)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/transformers/models/gpt2/../../onnx_model.py", line 594, in convert_float_to_float16
    model = shape_infer_helper.infer_shapes(model, auto_merge=True, guess_output_rank=False)
  File "/usr/local/lib/python3.8/dist-packages/onnxruntime/transformers/models/gpt2/../../../tools/symbolic_shape_infer.py", line 2347, in infer_shapes
    raise Exception("Incomplete symbolic shape inference")
Exception: Incomplete symbolic shape inference

Expected behavior

We expected the optimization went will without exception.

michaelbenayoun commented 1 year ago

Hi @bugzyz,

Your issue seems to be the same as #608. I opened a PR on the ONNX Runtime repo that should solve this, in the mean time, I recommend you to simply set the optimization_level to 1:

from optimum.onnxruntime import OptimizationConfig

optimization_config = OptimizationConfig(
    optimization_level=1,
    enable_transformers_specific_optimizations=True,
    fp16=True,
    for_gpu=True,
)

bugzyz commented 1 year ago

Thank you @michaelbenayoun . Just curious, is this same as the quantization feature (ORTQuantizer)?

michaelbenayoun commented 1 year ago

No it is not. The code snippet you shared is mostly performing graph optimization, meaning that operations are fused together to make inference faster. It also casts the weights to float16, which can be considered quantization, but that's not what is meant by quantization in general.

If you want to quantize your model to run some operations in int8, use the ORTQuantizer.

You can find more information about quantization here, and how to apply it to your ONNX models here.

regisss commented 1 year ago

@bugzyz In the meantime, could you try adding disable_shape_inference=True to your optimization config? You will need to install Optimum from source with:

pip install git+https://github.com/huggingface/optimum.git

This will disable symbolic shape inference, which should make the model optimization succeed (unless the default shape inference fails too).

bugzyz commented 1 year ago

@regisss Sure, I will try the configuration!

bugzyz commented 1 year ago

@regisss Thanks, after adding this configuration. The optimization went well. Thanks!

qunash commented 10 months ago

Hey @bugzyz, did you succeed in optimizing m2m100 418M? Trying to convert it to ONNX and optimizing somehow results in a model that's bigger in size and 5x slower at inference (on CPU) than the original...

regisss commented 10 months ago

@qunash Could you share the script/command you used to export and optimize your model?

qunash commented 10 months ago

@qunash Could you share the script/command you used to export and optimize your model?

Thanks for answering!

Here's my code:

from optimum.onnxruntime import ORTOptimizer, ORTModelForSeq2SeqLM
from optimum.onnxruntime.configuration import OptimizationConfig

model_id = "anzorq/m2m100_418M_ft_ru-kbd_44K"
optimized_onnx_save_dir = "m2m100_optimized"

model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)

optimizer = ORTOptimizer.from_pretrained(model)

optimization_config = OptimizationConfig(
    optimization_level=2,
    optimize_with_onnxruntime_only=False,
    optimize_for_gpu=False,
    fp16=True,
    disable_shape_inference=True,
)

optimizer.optimize(save_dir=optimized_onnx_save_dir, optimization_config=optimization_config)

I also tried with optimum-cli with the exact same result: optimum-cli export onnx --model anzorq/m2m100_418M_ft_ru-kbd_44K --optimize O3 output

And for inference:

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM

model_id = "anzorq/m2m100_418M_ft_ru-kbd_44K"

model = ORTModelForSeq2SeqLM.from_pretrained(model_id, subfolder="onnx", file_name="encoder_model_optimized.onnx")
tokenizer = AutoTokenizer.from_pretrained(model_id)

def translate(text, num_beams=4, num_return_sequences=4):
  inputs = tokenizer(text, return_tensors="pt")

  num_return_sequences = min(num_return_sequences, num_beams)

  translated_tokens = model.generate(
      **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["zu"], num_beams=num_beams, num_return_sequences=num_return_sequences
  )

  translations = []
  for translation in tokenizer.batch_decode(translated_tokens, skip_special_tokens=True):
      translations.append(translation)

  return text, translations

translate("some text")

huggingface / optimum