量化会卡住，Issues里很多人遇到了同样的问题，但都没有解决方案

ConniePK commented 3 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.3.dev0
Platform: Linux-5.4.54-1.0.0.std7c.el7.2.x86_64-x86_64-with-glibc2.29
Python version: 3.8.10
PyTorch version: 2.3.0+cu121 (GPU)
Transformers version: 4.41.2
Datasets version: 2.16.0
Accelerate version: 0.30.1
PEFT version: 0.11.1
TRL version: 0.8.6
GPU type: NVIDIA A800-SXM4-80GB
Bitsandbytes version: 0.43.0
vLLM version: 0.4.3

Reproduction

量化qwen72b，会一直卡在某个节点，CUDA_VISIBLE_DEVICES=1,3,7 llamafactory-cli export my_examples/gptq.yaml

[WARNING|tokenization_utils_base.py:3921] 2024-07-25 01:19:29,107 >> Token indices sequence length is longer than the specified maximum sequence length for this model (8860 > 8192). Running this sequence through the model will result in indexing errors
[INFO|quantization_config.py:690] 2024-07-25 01:19:30,025 >> You have activated exllama backend. Note that you can get better inference speed using exllamav2 kernel by setting `exllama_config`.
07/25/2024 01:19:30 - INFO - llamafactory.model.model_utils.quantization - Quantizing model to 4 bit with AutoGPTQ.
07/25/2024 01:19:30 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.
CUDA extension not installed.
CUDA extension not installed.
[INFO|modeling_utils.py:3471] 2024-07-25 01:19:30,886 >> loading weights file export_output/chat_and_sft/sft_qwen72b_300_chat_v48/model.safetensors.index.json
[INFO|modeling_utils.py:1519] 2024-07-25 01:19:30,887 >> Instantiating QWenLMHeadModel model under default dtype torch.float16.
[INFO|configuration_utils.py:962] 2024-07-25 01:19:31,352 >> Generate config GenerationConfig {}

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [01:40<00:00,  6.73s/it]
[INFO|modeling_utils.py:4280] 2024-07-25 01:21:13,776 >> All model checkpoint weights were used when initializing QWenLMHeadModel.

[INFO|modeling_utils.py:4288] 2024-07-25 01:21:13,776 >> All the weights of QWenLMHeadModel were initialized from the model checkpoint at export_output/chat_and_sft/sft_qwen72b_300_chat_v48/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use QWenLMHeadModel for predictions without further training.
[INFO|configuration_utils.py:915] 2024-07-25 01:21:13,783 >> loading configuration file export_output/chat_and_sft/sft_qwen72b_300_chat_v48/generation_config.json
[INFO|configuration_utils.py:962] 2024-07-25 01:21:13,783 >> Generate config GenerationConfig {
  "chat_format": "chatml",
  "do_sample": true,
  "eos_token_id": 151643,
  "max_new_tokens": 512,
  "max_window_size": 6144,
  "pad_token_id": 151643,
  "repetition_penalty": 1.1,
  "top_k": 0,
  "top_p": 0.8,
  "trust_remote_code": true
}

Quantizing transformer.h blocks : 100%|███████████████████████████████████████████████████████████████████████████████████████| 80/80 [58:44<00:00, 44.05s/it]
/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py:4481: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(

尝试很多次都是一样的结果，最长的运行了一天还不能得到量化结果 Screenshot from 2024-07-25 09-17-03

我的.yaml文件如下:

### model
model_name_or_path: export_output/chat_and_sft/sft_qwen72b_300_chat_v48/
template: qwen

### export
export_dir: export_output/chat_and_sft/sft_qwen72b_300_chat_v48_gptq_v4/
export_quantization_bit: 4
export_quantization_dataset: src/c4_demo.json
export_size: 2
export_device: cpu
export_legacy_format: false