Closed YihengBrianWu closed 2 months ago
This issue is documented in our paper (https://arxiv.org/pdf/2309.05516v3) in Table 14, with a detailed explanation in Section 4.1. We hypothesize that the perplexity is highly sensitive to outliers. However, our limited tests did not show a significant impact in real deployment. To avoid this issue, setting the minmax lr to 2.0/iterations could be a solution based on my experiments for this model.
Besides, if your gpu memory is enough, you could set--disable_gpu_memory_usage, typically 1.5x-2x speedup based on my experiments.
Besides, if your gpu memory is enough, you could set--disable_gpu_memory_usage, typically 1.5x-2x speedup based on my experiments.
Cool! Thanks for your help!
feel free to reopen it if there are more questions.
I'm now trying to quantize llama2-7b under w4a16g128 setting. The script is
python3 main.py \ --model_name /mnt/bn/wyh-train/4bit/models/llama2-7b/model \ --device 0 \ --group_size 128 \ --bits 4 \ --iters 1000 \ --deployment_device 'fake,cpu,gpu' \ --output_dir "/mnt/bn/wyh-train/4bit/models/llama2-7b-auto-round"
The result is wikitext2 c4 llama2-7b-fp16 5.4721 6.9727 llama2-7b-w4a16g128(auto_round) 10.4401 7.4204
Any Insight here?