Open shaonianyr opened 1 year ago
Hello there, We are still exploring what is the most robust quantization option for this model. Out of personal interest I would be interested in knowing the specific error that you ran into. Could you copy/paste your error here?
Thanks to reply. The error occurs when calling int8-onnx like: "onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from /data/int8-onnx/decoder_model_quantized.onnx failed:Protobuf parsing failed."
In theory in8, int4 should work properly in Llama2 at least you can find Q4, Q8 and even Q2 quantization on HF Model Hub, but not in the ONNX format though (GGUF / GGML to have the Q2).
Use this script to build int8 but failed: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama