intel / auto-round

Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs"
https://arxiv.org/abs/2309.05516
Apache License 2.0
172 stars 20 forks source link

Unexpected ppl diff #116

Closed YihengBrianWu closed 2 months ago

YihengBrianWu commented 3 months ago

I'm now trying to quantize llama2-7b under w4a16g128 setting. The script is python3 main.py \ --model_name /mnt/bn/wyh-train/4bit/models/llama2-7b/model \ --device 0 \ --group_size 128 \ --bits 4 \ --iters 1000 \ --deployment_device 'fake,cpu,gpu' \ --output_dir "/mnt/bn/wyh-train/4bit/models/llama2-7b-auto-round"

The result is wikitext2 c4 llama2-7b-fp16 5.4721 6.9727 llama2-7b-w4a16g128(auto_round) 10.4401 7.4204

Any Insight here?

wenhuach21 commented 3 months ago

This issue is documented in our paper (https://arxiv.org/pdf/2309.05516v3) in Table 14, with a detailed explanation in Section 4.1. We hypothesize that the perplexity is highly sensitive to outliers. However, our limited tests did not show a significant impact in real deployment. To avoid this issue, setting the minmax lr to 2.0/iterations could be a solution based on my experiments for this model.

wenhuach21 commented 3 months ago

Besides, if your gpu memory is enough, you could set--disable_gpu_memory_usage, typically 1.5x-2x speedup based on my experiments.

YihengBrianWu commented 3 months ago

Besides, if your gpu memory is enough, you could set--disable_gpu_memory_usage, typically 1.5x-2x speedup based on my experiments.

Cool! Thanks for your help!

wenhuach21 commented 2 months ago

feel free to reopen it if there are more questions.