microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.22k stars 2.87k forks source link

Quantized SeaLLM v2 Model Outputs Same as Input #21636

Open sabre-code opened 1 month ago

sabre-code commented 1 month ago

Describe the issue

We encountered an issue while using SeaLLM v2, a 7B model, in ONNX format with int8 quantization for translation purposes. Here are the steps we followed and the problem we're facing:

  1. Model Conversion to ONNX:

    • We used the Optimum CLI to convert SeaLLM v2 into ONNX format.
    • The conversion resulted in a full precision (fp32) ONNX model.
  2. Model Quantization:

    • We applied the quantize_dynamic() function to convert the fp32 model to int8.
    • The quantization process completed without errors.
  3. Issue:

    • When using the quantized model for translation, the output is identical to the input.
    • This issue is not isolated to SeaLLM v2; we have faced similar problems with other model quantizations like TinyLlama.

To reproduce

Steps to Reproduce:

  1. Convert SeaLLM v2 to ONNX using the Optimum CLI.
  2. Quantize the ONNX model from fp32 to int8 using quantize_dynamic().
  3. Use the quantized model for a translation task.
  4. Observe that the output is the same as the input.

Urgency

No response

Platform

Linux

OS Version

Ubutu 20.04.6

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

yufenglee commented 1 month ago

@sabre-code, could you please try running this model with onnxruntime-genai? And here is the example to create the model and run the similar model: https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/README.md#get-the-model