[Performance] LLM Accuracy Significantly Dropped after dynamic_quantization

zjc664656505 commented 11 months ago

Describe the issue

Recently, I quantized LLM, the bloom series of model in 3B and 1.7B into int8 using the dynamic_quantapi. After quantization, the accuracy for bloom3b and bloom1b7 drops from 0.656 and 0.975 to 0.323 and 0.792.

I'm very confused on why this is happening. I understand that quantization would result in performance degradation, but this performance drop is too high. Could I get some help on how to resolve this issue?

To reproduce

I quantize the model using the default dynamic_quant setting.

quantize_dynamic(model_path, quantized_model_path, weight_type=QuantType.QInt8, use_external_data_format=True)

Urgency

This is urgent and hope to be resolved as soon as possible.

Platform

Linux

OS Version

22.01

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15.1

ONNX Runtime API

Python

Architecture

X86

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

yufenglee commented 11 months ago

As you mentioned, quantization can result in performance degradation. For model with more layers, the degradation can get larger.

For LLM models, we are adding support of 4bits blockwise quantization, you can try it with tool to test accuracy :https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py.

it is still in progress. Now only small set of kernels is optimized: only batch 1 on GPU has good latency. We are adding kernels for cpu, and batch>1 for GPU and target for 1.17.

zjc664656505 commented 11 months ago

May I ask whether the int4 model supports mobile ARM CPU?

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

microsoft / onnxruntime