Quantized SeaLLM v2 Model Outputs Same as Input

Describe the issue

We encountered an issue while using SeaLLM v2, a 7B model, in ONNX format with int8 quantization for translation purposes. Here are the steps we followed and the problem we're facing:

Model Conversion to ONNX:
- We used the Optimum CLI to convert SeaLLM v2 into ONNX format.
- The conversion resulted in a full precision (fp32) ONNX model.
Model Quantization:
- We applied the quantize_dynamic() function to convert the fp32 model to int8.
- The quantization process completed without errors.
Issue:
- When using the quantized model for translation, the output is identical to the input.
- This issue is not isolated to SeaLLM v2; we have faced similar problems with other model quantizations like TinyLlama.

To reproduce

Steps to Reproduce:

Convert SeaLLM v2 to ONNX using the Optimum CLI.
Quantize the ONNX model from fp32 to int8 using quantize_dynamic().
Use the quantized model for a translation task.
Observe that the output is the same as the input.

Urgency

No response

Platform

Linux

OS Version

Ubutu 20.04.6

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

microsoft / onnxruntime

Quantized SeaLLM v2 Model Outputs Same as Input #21636

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?