Open bugzyz opened 1 year ago
Hi @bugzyz,
Your issue seems to be the same as #608. I opened a PR on the ONNX Runtime repo that should solve this, in the mean time, I recommend you to simply set the optimization_level
to 1:
from optimum.onnxruntime import OptimizationConfig
optimization_config = OptimizationConfig(
optimization_level=1,
enable_transformers_specific_optimizations=True,
fp16=True,
for_gpu=True,
)
Thank you @michaelbenayoun . Just curious, is this same as the quantization feature (ORTQuantizer)?
No it is not. The code snippet you shared is mostly performing graph optimization, meaning that operations are fused together to make inference faster. It also casts the weights to float16
, which can be considered quantization, but that's not what is meant by quantization in general.
If you want to quantize your model to run some operations in int8
, use the ORTQuantizer
.
You can find more information about quantization here, and how to apply it to your ONNX models here.
@bugzyz In the meantime, could you try adding disable_shape_inference=True
to your optimization config?
You will need to install Optimum from source with:
pip install git+https://github.com/huggingface/optimum.git
This will disable symbolic shape inference, which should make the model optimization succeed (unless the default shape inference fails too).
@regisss Sure, I will try the configuration!
@regisss Thanks, after adding this configuration. The optimization went well. Thanks!
Hey @bugzyz, did you succeed in optimizing m2m100 418M? Trying to convert it to ONNX and optimizing somehow results in a model that's bigger in size and 5x slower at inference (on CPU) than the original...
@qunash Could you share the script/command you used to export and optimize your model?
@qunash Could you share the script/command you used to export and optimize your model?
Thanks for answering!
Here's my code:
from optimum.onnxruntime import ORTOptimizer, ORTModelForSeq2SeqLM
from optimum.onnxruntime.configuration import OptimizationConfig
model_id = "anzorq/m2m100_418M_ft_ru-kbd_44K"
optimized_onnx_save_dir = "m2m100_optimized"
model = ORTModelForSeq2SeqLM.from_pretrained(model_id, from_transformers=True)
optimizer = ORTOptimizer.from_pretrained(model)
optimization_config = OptimizationConfig(
optimization_level=2,
optimize_with_onnxruntime_only=False,
optimize_for_gpu=False,
fp16=True,
disable_shape_inference=True,
)
optimizer.optimize(save_dir=optimized_onnx_save_dir, optimization_config=optimization_config)
I also tried with optimum-cli
with the exact same result:
optimum-cli export onnx --model anzorq/m2m100_418M_ft_ru-kbd_44K --optimize O3 output
And for inference:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM
model_id = "anzorq/m2m100_418M_ft_ru-kbd_44K"
model = ORTModelForSeq2SeqLM.from_pretrained(model_id, subfolder="onnx", file_name="encoder_model_optimized.onnx")
tokenizer = AutoTokenizer.from_pretrained(model_id)
def translate(text, num_beams=4, num_return_sequences=4):
inputs = tokenizer(text, return_tensors="pt")
num_return_sequences = min(num_return_sequences, num_beams)
translated_tokens = model.generate(
**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["zu"], num_beams=num_beams, num_return_sequences=num_return_sequences
)
translations = []
for translation in tokenizer.batch_decode(translated_tokens, skip_special_tokens=True):
translations.append(translation)
return text, translations
translate("some text")
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Adding the fp16=True in OptimizationConfig will cause the optimization failed (Exception: Incomplete symbolic shape inference)
error message:
Expected behavior
We expected the optimization went will without exception.