How to Generate a 2-bit Quantized Meta-Llama-3.1-8B-Instruct Model?

ForAxel commented 2 days ago

I found a similar closed issue related to this topic. Following your reply in that issue, I successfully configured the vptq-algo environment based on the tutorial in the algorithm branch. The Quantization on Meta-Llama-3.1-8B-Instruct section provides an example of using VPTQ to generate 3-bit quantized Meta-Llama-3.1-8B-Instruct model. However, if I want to generate the 2.3-bit quantized Meta-Llama-3.1-8B-Instruct model provided in VPTQ-community, how should I configure the parameters for run_vptq.py? Specifically, which arguments should I adjust to achieve 2.3-bit quantization? Looking forward to your reply.

YangWang92 commented 1 day ago

Try this one, and you can dowload hessian matrix from here https://huggingface.co/collections/VPTQ-community/hessian-and-invhessian-checkpoints-66fd249a104850d17b23fd8b .

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_vptq.py \
        --model_name Qwen/Qwen2.5-7B-Instruct \
        --output_dir outputs/Qwen2.5-7B-Instruct/ \
        --vector_lens -1 12 \
        --group_num 1 \
        --num_centroids -1 65536 \
        --num_res_centroids -1 4096 \
        --npercent 0 \
        --blocksize 128 \
        --new_eval \
        --seq_len 8192 \
        --kmeans_mode hessian \
        --num_gpus 8 \
        --enable_perm \
        --enable_norm \
        --save_model \
        --save_packed_model \
        --hessian_path Hessians-Qwen2.5-7B-Instruct-6144-8k \
        --inv_hessian_path InvHessians-Qwen2.5-7B-Instruct-6144-8k \
        --ktol 1e-5 --kiter 100

ForAxel commented 1 day ago

Try this one, and you can dowload hessian matrix from here https://huggingface.co/collections/VPTQ-community/hessian-and-invhessian-checkpoints-66fd249a104850d17b23fd8b .

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_vptq.py \ --model_name Qwen/Qwen2.5-7B-Instruct \ --output_dir outputs/Qwen2.5-7B-Instruct/ \ --vector_lens -1 12 \ --group_num 1 \ --num_centroids -1 65536 \ --num_res_centroids -1 4096 \ --npercent 0 \ --blocksize 128 \ --new_eval \ --seq_len 8192 \ --kmeans_mode hessian \ --num_gpus 8 \ --enable_perm \ --enable_norm \ --save_model \ --save_packed_model \ --hessian_path Hessians-Qwen2.5-7B-Instruct-6144-8k \ --inv_hessian_path InvHessians-Qwen2.5-7B-Instruct-6144-8k \ --ktol 1e-5 --kiter 100

Thanks for your response. I noticed that the command you provided is designed for quantizing the Qwen2.5-7B model. Is it possible to directly apply the parameter settings in this command to the 2.3-bit quantization of the LLaMA3.1-8B model? The command I am using is as follows:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python run_vptq.py \
        --model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
        --output_dir outputs/Meta-Llama-3.1-8B-Instruct-2.3bit/ \
        --vector_lens -1 12 \
        --group_num 1 \
        --num_centroids -1 65536 \
        --num_res_centroids -1 4096 \
        --npercent 0 \
        --blocksize 128 \
        --new_eval \
        --seq_len 8192 \
        --kmeans_mode hessian \
        --num_gpus 6 \
        --enable_perm \
        --enable_norm \
        --save_model \
        --save_packed_model \
        --hessian_path Hessians-Llama-31-8B-Instruct-6144-8k \
        --inv_hessian_path InvHessians-Llama-31-8B-Instruct-6144-8k \
        --ktol 1e-5 --kiter 100

Additionally, I would like to better understand the relationship between run_vptq.py parameters (such as vector_lens, num_centroids, num_res_centroids, etc.) and the resulting quantized models, particularly in terms of the bit-width of the quantized models in the VPTQ-community. Will this information be released in the future?

I also noticed in Table 10 of the VPTQ paper's appendix fine-tuning was applied to the 2.3-bit quantized version of the LLaMA2-13B model. Could you provide more details about this fine-tuning operation? Is it available in the source code?

Thank you for your excellent work! I am looking forward to the release of the detailed quantization tutorial.

YangWang92 commented 1 day ago

Yes, you can directly replace the model name and Hessian matrix to quantize different models. Additionally, here is a quick guide on setting the quantization parameters: https://github.com/microsoft/VPTQ?tab=readme-ov-file#models-from-open-source-community.

The fine-tuning code has not been open-sourced yet. It is based on a simple modification of LlamaFactory, and I will release this part soon. Please stay tuned.

microsoft / VPTQ

How to Generate a 2-bit Quantized Meta-Llama-3.1-8B-Instruct Model? #126