Open zhangyu68 opened 4 months ago
I’m trying to quantize a int4 model, but this file only provides the weight-only-quantization. If I can quantize both weight and activation to int4 ? https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py
Thanks for your help!
No response
onnx
python -m onnxruntime.transformers.models.llama.convert_to_onnx -m /publicdata/huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/ --output llama3-8b-int4-gpu --precision int4 --execution_provider cuda --quantization_method blockwise --use_gqa
except can quantize both weight and activation
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the issue
I’m trying to quantize a int4 model, but this file only provides the weight-only-quantization. If I can quantize both weight and activation to int4 ? https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py
Thanks for your help!
Urgency
No response
Target platform
onnx
Build script
python -m onnxruntime.transformers.models.llama.convert_to_onnx -m /publicdata/huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/ --output llama3-8b-int4-gpu --precision int4 --execution_provider cuda --quantization_method blockwise --use_gqa
Error / output
except can quantize both weight and activation
Visual Studio Version
No response
GCC / Compiler Version
No response