When I use autofp8 to quantize the qwen32b model and test it, the accuracy drops significantly.

zhangfzR commented 3 months ago

` from transformers import AutoTokenizer

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "/mnt/public/modelist/Qwen1.5-32B" quantized_model_dir = "/mnt/public/autofp8_qwen_32b_fp8_static_new"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True) examples = ["auto_fp8 is an easy-to-use model quantization library"] examples = tokenizer(examples, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig( quant_method="fp8", activation_scheme="static", # or "static" ignore_patterns=["re:.*lm_head"], )

model = AutoFP8ForCausalLM.from_pretrained( pretrained_model_dir, quantize_config=quantize_config ) model.quantize(examples) model.save_quantized(quantized_model_dir) `

I use the code to quantize the qwen32b model and test the quantized model using the lm-evaluation-harness as follows:

CUDA_VISIBLE_DEVICES=0,1 lm-eval --model vllm --model_args pretrained=/mnt/public/autofp8_qwen_32b_fp8_static_new,max_model_len=8192,tensor_parallel_size=2,max_num_seqs=8,gpu_memory_utilization=0.8 --task mmlu --trust_remote_code

"To my surprise, the accuracy dropped significantly."

Do you have any suggestions on how to address this issue?

zhangfzR commented 3 months ago

@mgoin hi mgoin, After I quantify the model, this warning message will appear. I'm not sure if it's because these warning messages affect the quantification results. Do you have any good suggestions

mgoin commented 3 months ago

Hi @zhangfzR you need to use more samples to properly calibrate static activation scales. Please rerun the flow you described above based on this example using a chat dataset: https://github.com/neuralmagic/AutoFP8/blob/main/example_dataset.py

Please try again with proper calibration data. For reference here is a Qwen2 72B that achieved essentially lossless recovery: https://huggingface.co/neuralmagic/Qwen2-72B-Instruct-FP8

I will update the documentation to be more clear about this.

zhangfzR commented 3 months ago

Thank you. When I cloned a new code and tried to rerun it, it worked. I might have made some changes that caused it not to work well. This issue can be closed.

neuralmagic / AutoFP8

When I use autofp8 to quantize the qwen32b model and test it, the accuracy drops significantly. #24