[WARNING|tokenization_utils_base.py:3921] 2024-07-25 01:19:29,107 >> Token indices sequence length is longer than the specified maximum sequence length for this model (8860 > 8192). Running this sequence through the model will result in indexing errors
[INFO|quantization_config.py:690] 2024-07-25 01:19:30,025 >> You have activated exllama backend. Note that you can get better inference speed using exllamav2 kernel by setting `exllama_config`.
07/25/2024 01:19:30 - INFO - llamafactory.model.model_utils.quantization - Quantizing model to 4 bit with AutoGPTQ.
07/25/2024 01:19:30 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.
CUDA extension not installed.
CUDA extension not installed.
[INFO|modeling_utils.py:3471] 2024-07-25 01:19:30,886 >> loading weights file export_output/chat_and_sft/sft_qwen72b_300_chat_v48/model.safetensors.index.json
[INFO|modeling_utils.py:1519] 2024-07-25 01:19:30,887 >> Instantiating QWenLMHeadModel model under default dtype torch.float16.
[INFO|configuration_utils.py:962] 2024-07-25 01:19:31,352 >> Generate config GenerationConfig {}
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [01:40<00:00, 6.73s/it]
[INFO|modeling_utils.py:4280] 2024-07-25 01:21:13,776 >> All model checkpoint weights were used when initializing QWenLMHeadModel.
[INFO|modeling_utils.py:4288] 2024-07-25 01:21:13,776 >> All the weights of QWenLMHeadModel were initialized from the model checkpoint at export_output/chat_and_sft/sft_qwen72b_300_chat_v48/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use QWenLMHeadModel for predictions without further training.
[INFO|configuration_utils.py:915] 2024-07-25 01:21:13,783 >> loading configuration file export_output/chat_and_sft/sft_qwen72b_300_chat_v48/generation_config.json
[INFO|configuration_utils.py:962] 2024-07-25 01:21:13,783 >> Generate config GenerationConfig {
"chat_format": "chatml",
"do_sample": true,
"eos_token_id": 151643,
"max_new_tokens": 512,
"max_window_size": 6144,
"pad_token_id": 151643,
"repetition_penalty": 1.1,
"top_k": 0,
"top_p": 0.8,
"trust_remote_code": true
}
Quantizing transformer.h blocks : 100%|███████████████████████████████████████████████████████████████████████████████████████| 80/80 [58:44<00:00, 44.05s/it]
/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py:4481: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
warnings.warn(
尝试很多次都是一样的结果,最长的运行了一天还不能得到量化结果
我的.yaml文件如下:
### model
model_name_or_path: export_output/chat_and_sft/sft_qwen72b_300_chat_v48/
template: qwen
### export
export_dir: export_output/chat_and_sft/sft_qwen72b_300_chat_v48_gptq_v4/
export_quantization_bit: 4
export_quantization_dataset: src/c4_demo.json
export_size: 2
export_device: cpu
export_legacy_format: false
Reminder
System Info
llamafactory
version: 0.8.3.dev0Reproduction
量化qwen72b,会一直卡在某个节点,CUDA_VISIBLE_DEVICES=1,3,7 llamafactory-cli export my_examples/gptq.yaml
尝试很多次都是一样的结果,最长的运行了一天还不能得到量化结果
我的.yaml文件如下:
Expected behavior
No response
Others
No response