Does llama2 support int8 quantization?

microsoft / Llama-2-Onnx

Other

1.02k stars 94 forks source link

Does llama2 support int8 quantization? #16

Open shaonianyr opened 1 year ago

shaonianyr commented 1 year ago

Use this script to build int8 but failed: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama

JoshuaElsdon commented 1 year ago

Hello there, We are still exploring what is the most robust quantization option for this model. Out of personal interest I would be interested in knowing the specific error that you ran into. Could you copy/paste your error here?

shaonianyr commented 1 year ago

Thanks to reply. The error occurs when calling int8-onnx like: "onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from /data/int8-onnx/decoder_model_quantized.onnx failed:Protobuf parsing failed."

loretoparisi commented 1 year ago

In theory in8, int4 should work properly in Llama2 at least you can find Q4, Q8 and even Q2 quantization on HF Model Hub, but not in the ONNX format though (GGUF / GGML to have the Q2).