hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
33.58k stars 4.12k forks source link

量化会卡住,Issues里很多人遇到了同样的问题,但都没有解决方案 #4963

Closed ConniePK closed 11 hours ago

ConniePK commented 3 months ago

Reminder

System Info

Reproduction

量化qwen72b,会一直卡在某个节点,CUDA_VISIBLE_DEVICES=1,3,7 llamafactory-cli export my_examples/gptq.yaml

[WARNING|tokenization_utils_base.py:3921] 2024-07-25 01:19:29,107 >> Token indices sequence length is longer than the specified maximum sequence length for this model (8860 > 8192). Running this sequence through the model will result in indexing errors
[INFO|quantization_config.py:690] 2024-07-25 01:19:30,025 >> You have activated exllama backend. Note that you can get better inference speed using exllamav2 kernel by setting `exllama_config`.
07/25/2024 01:19:30 - INFO - llamafactory.model.model_utils.quantization - Quantizing model to 4 bit with AutoGPTQ.
07/25/2024 01:19:30 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.
CUDA extension not installed.
CUDA extension not installed.
[INFO|modeling_utils.py:3471] 2024-07-25 01:19:30,886 >> loading weights file export_output/chat_and_sft/sft_qwen72b_300_chat_v48/model.safetensors.index.json
[INFO|modeling_utils.py:1519] 2024-07-25 01:19:30,887 >> Instantiating QWenLMHeadModel model under default dtype torch.float16.
[INFO|configuration_utils.py:962] 2024-07-25 01:19:31,352 >> Generate config GenerationConfig {}

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [01:40<00:00,  6.73s/it]
[INFO|modeling_utils.py:4280] 2024-07-25 01:21:13,776 >> All model checkpoint weights were used when initializing QWenLMHeadModel.

[INFO|modeling_utils.py:4288] 2024-07-25 01:21:13,776 >> All the weights of QWenLMHeadModel were initialized from the model checkpoint at export_output/chat_and_sft/sft_qwen72b_300_chat_v48/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use QWenLMHeadModel for predictions without further training.
[INFO|configuration_utils.py:915] 2024-07-25 01:21:13,783 >> loading configuration file export_output/chat_and_sft/sft_qwen72b_300_chat_v48/generation_config.json
[INFO|configuration_utils.py:962] 2024-07-25 01:21:13,783 >> Generate config GenerationConfig {
  "chat_format": "chatml",
  "do_sample": true,
  "eos_token_id": 151643,
  "max_new_tokens": 512,
  "max_window_size": 6144,
  "pad_token_id": 151643,
  "repetition_penalty": 1.1,
  "top_k": 0,
  "top_p": 0.8,
  "trust_remote_code": true
}

Quantizing transformer.h blocks : 100%|███████████████████████████████████████████████████████████████████████████████████████| 80/80 [58:44<00:00, 44.05s/it]
/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py:4481: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(

尝试很多次都是一样的结果,最长的运行了一天还不能得到量化结果 Screenshot from 2024-07-25 09-17-03

我的.yaml文件如下:

### model
model_name_or_path: export_output/chat_and_sft/sft_qwen72b_300_chat_v48/
template: qwen

### export
export_dir: export_output/chat_and_sft/sft_qwen72b_300_chat_v48_gptq_v4/
export_quantization_bit: 4
export_quantization_dataset: src/c4_demo.json
export_size: 2
export_device: cpu
export_legacy_format: false

Expected behavior

No response

Others

No response

yugecode commented 2 months ago

我也是一样的问题,请问有解决吗?

jym-coder commented 2 months ago

我也遇到一样的问题,请问有解决吗? image

tghfly commented 1 month ago

可以尝试用一个小点的模型试试,我用了Qwen2-1.5B的模型量化成gptq花了近10分钟,在100%之后大概等了近3分钟,这段时间应该在写文件落盘