[Build] How can I quantize the llama3 model activation to int4 ?

Describe the issue

I’m trying to quantize a int4 model, but this file only provides the weight-only-quantization. If I can quantize both weight and activation to int4 ? https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/matmul_4bits_quantizer.py

Thanks for your help!

Urgency

No response

Target platform

onnx

Build script

python -m onnxruntime.transformers.models.llama.convert_to_onnx -m /publicdata/huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/ --output llama3-8b-int4-gpu --precision int4 --execution_provider cuda --quantization_method blockwise --use_gqa

Error / output

except can quantize both weight and activation

Visual Studio Version

No response

GCC / Compiler Version

No response

microsoft / onnxruntime

[Build] How can I quantize the llama3 model activation to int4 ? #21334

Describe the issue

Urgency

Target platform

Build script

Error / output

Visual Studio Version

GCC / Compiler Version