Open ForAxel opened 2 days ago
Try this one, and you can dowload hessian matrix from here https://huggingface.co/collections/VPTQ-community/hessian-and-invhessian-checkpoints-66fd249a104850d17b23fd8b .
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_vptq.py \
--model_name Qwen/Qwen2.5-7B-Instruct \
--output_dir outputs/Qwen2.5-7B-Instruct/ \
--vector_lens -1 12 \
--group_num 1 \
--num_centroids -1 65536 \
--num_res_centroids -1 4096 \
--npercent 0 \
--blocksize 128 \
--new_eval \
--seq_len 8192 \
--kmeans_mode hessian \
--num_gpus 8 \
--enable_perm \
--enable_norm \
--save_model \
--save_packed_model \
--hessian_path Hessians-Qwen2.5-7B-Instruct-6144-8k \
--inv_hessian_path InvHessians-Qwen2.5-7B-Instruct-6144-8k \
--ktol 1e-5 --kiter 100
Try this one, and you can dowload hessian matrix from here https://huggingface.co/collections/VPTQ-community/hessian-and-invhessian-checkpoints-66fd249a104850d17b23fd8b .
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python run_vptq.py \ --model_name Qwen/Qwen2.5-7B-Instruct \ --output_dir outputs/Qwen2.5-7B-Instruct/ \ --vector_lens -1 12 \ --group_num 1 \ --num_centroids -1 65536 \ --num_res_centroids -1 4096 \ --npercent 0 \ --blocksize 128 \ --new_eval \ --seq_len 8192 \ --kmeans_mode hessian \ --num_gpus 8 \ --enable_perm \ --enable_norm \ --save_model \ --save_packed_model \ --hessian_path Hessians-Qwen2.5-7B-Instruct-6144-8k \ --inv_hessian_path InvHessians-Qwen2.5-7B-Instruct-6144-8k \ --ktol 1e-5 --kiter 100
Thanks for your response. I noticed that the command you provided is designed for quantizing the Qwen2.5-7B
model. Is it possible to directly apply the parameter settings in this command to the 2.3-bit quantization of the LLaMA3.1-8B
model? The command I am using is as follows:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python run_vptq.py \
--model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
--output_dir outputs/Meta-Llama-3.1-8B-Instruct-2.3bit/ \
--vector_lens -1 12 \
--group_num 1 \
--num_centroids -1 65536 \
--num_res_centroids -1 4096 \
--npercent 0 \
--blocksize 128 \
--new_eval \
--seq_len 8192 \
--kmeans_mode hessian \
--num_gpus 6 \
--enable_perm \
--enable_norm \
--save_model \
--save_packed_model \
--hessian_path Hessians-Llama-31-8B-Instruct-6144-8k \
--inv_hessian_path InvHessians-Llama-31-8B-Instruct-6144-8k \
--ktol 1e-5 --kiter 100
Additionally, I would like to better understand the relationship between run_vptq.py
parameters (such as vector_lens
, num_centroids
, num_res_centroids
, etc.) and the resulting quantized models, particularly in terms of the bit-width of the quantized models in the VPTQ-community. Will this information be released in the future?
I also noticed in Table 10 of the VPTQ paper's appendix fine-tuning was applied to the 2.3-bit quantized version of the LLaMA2-13B
model. Could you provide more details about this fine-tuning operation? Is it available in the source code?
Thank you for your excellent work! I am looking forward to the release of the detailed quantization tutorial.
Yes, you can directly replace the model name and Hessian matrix to quantize different models. Additionally, here is a quick guide on setting the quantization parameters: https://github.com/microsoft/VPTQ?tab=readme-ov-file#models-from-open-source-community.
The fine-tuning code has not been open-sourced yet. It is based on a simple modification of LlamaFactory, and I will release this part soon. Please stay tuned.
I found a similar closed issue related to this topic. Following your reply in that issue, I successfully configured the
vptq-algo
environment based on the tutorial in the algorithm branch. TheQuantization on Meta-Llama-3.1-8B-Instruct
section provides an example of using VPTQ to generate3-bit
quantizedMeta-Llama-3.1-8B-Instruct
model. However, if I want to generate the2.3-bit
quantizedMeta-Llama-3.1-8B-Instruct
model provided in VPTQ-community, how should I configure the parameters for run_vptq.py? Specifically, which arguments should I adjust to achieve 2.3-bit quantization? Looking forward to your reply.