Open zjc664656505 opened 11 months ago
As you mentioned, quantization can result in performance degradation. For model with more layers, the degradation can get larger.
For LLM models, we are adding support of 4bits blockwise quantization, you can try it with tool to test accuracy :https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py.
it is still in progress. Now only small set of kernels is optimized: only batch 1 on GPU has good latency. We are adding kernels for cpu, and batch>1 for GPU and target for 1.17.
May I ask whether the int4 model supports mobile ARM CPU?
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the issue
Recently, I quantized LLM, the bloom series of model in 3B and 1.7B into int8 using the
dynamic_quant
api. After quantization, the accuracy for bloom3b and bloom1b7 drops from 0.656 and 0.975 to 0.323 and 0.792.I'm very confused on why this is happening. I understand that quantization would result in performance degradation, but this performance drop is too high. Could I get some help on how to resolve this issue?
To reproduce
I quantize the model using the default
dynamic_quant
setting.quantize_dynamic(model_path, quantized_model_path, weight_type=QuantType.QInt8, use_external_data_format=True)
Urgency
This is urgent and hope to be resolved as soon as possible.
Platform
Linux
OS Version
22.01
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.15.1
ONNX Runtime API
Python
Architecture
X86
Execution Provider
Default CPU
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
Yes