hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
33.04k stars 4.06k forks source link

使用 THUDM/chatglm3-6b“ 和默认数据集训练j时一直提示 "Failed" #3569

Closed cfanbo closed 5 months ago

cfanbo commented 5 months ago

Reminder

Reproduction

个人第一次接触这一块的知识

这里是PC机器,RTX 3070 显卡 16G ,训练时提示“Failed.” 结果,导致 "Loss" 无任何数据。

以下是通过网页生成的最终执行脚本

CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path THUDM/chatglm3-6b \
    --finetuning_type lora \
    --template chatglm3 \
    --flash_attn auto \
    --dataset_dir data \
    --dataset alpaca_gpt4_zh,identity \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --optim adamw_torch \
    --packing False \
    --report_to none \
    --output_dir saves\ChatGLM3-6B-Chat\lora\train_2024-05-04-22-08-55 \
    --fp16 True \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0 \
    --use_dora True \
    --lora_target all \
    --plot_loss True

终端显示的信息

(llama_factory) C:\Users\Administrator\sxf_workspace\LLaMA-Factory>llamafactory-cli webui
bin C:\ProgramData\Anaconda3\envs\llama_factory\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
bin C:\ProgramData\Anaconda3\envs\llama_factory\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll
05/04/2024 22:19:35 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
C:\ProgramData\Anaconda3\envs\llama_factory\lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:19:55,214 >> loading file tokenizer.model from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\tokenizer.model
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:19:55,215 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:19:55,215 >> loading file special_tokens_map.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\special_tokens_map.json
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:19:55,216 >> loading file tokenizer_config.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\tokenizer_config.json
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:19:55,216 >> loading file tokenizer.json from cache at None
05/04/2024 22:19:55 - INFO - llmtuner.data.template - Add <|user|>,<|observation|> to stop words.
05/04/2024 22:19:55 - INFO - llmtuner.data.template - Cannot add this chat template to tokenizer.
05/04/2024 22:19:55 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json...
05/04/2024 22:19:56 - INFO - llmtuner.data.loader - Loading dataset identity.json...
input_ids:
[64790, 64792, 64795, 30910, 13, 30910, 31983, 35959, 32474, 34128, 31155, 64796, 30910, 13, 30910, 49141, 31983, 35959, 32474, 34128, 31211, 13, 13, 30939, 30930, 30910, 31983, 31902, 31651, 31155, 32096, 54725, 40215, 31902, 31903, 31123, 54627, 40657, 31201, 38187, 54746, 35384, 31123, 54558, 32079, 38771, 31740, 31123, 32316, 34779, 31996, 31123, 54724, 35434, 32382, 36490, 31155, 13, 13, 30943, 30930, 30910, 37167, 33296, 31155, 32096, 33777, 47049, 33908, 31201, 34396, 31201, 54580, 55801, 54679, 54542, 34166, 34446, 41635, 35471, 32445, 31123, 32317, 54589, 55611, 31201, 54589, 34166, 54542, 33185, 32357, 31123, 54548, 31983, 35959, 49339, 31155, 13, 13, 30966, 30930, 30910, 34192, 35285, 31155, 34192, 48191, 31740, 44323, 31123, 35315, 32096, 54720, 32444, 30910, 30981, 30941, 30973, 30910, 44442, 34192, 31155, 32775, 34192, 35434, 35763, 32507, 31123, 32079, 31902, 32683, 31123, 54724, 31803, 31937, 34757, 49510, 31155, 2]
inputs:
[gMASK] sop <|user|>
 保持健康的三个提示。 <|assistant|>
 以下是保持健康的三个提示:

1. 保持身体活动。每天做适当的身体运动,如散步、跑步或游泳,能促进心血管健康,增强肌肉力量,并有助于减少体重。

2. 均衡饮食。每天食用新鲜的蔬菜、水果、全谷物和脂肪含量低的蛋白质食物,避免高糖、高脂肪和加工食品,以保持健康的饮食习惯。

3. 睡眠充足。睡眠对人体健康至关重要,成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力,促进身体恢复,并提高注意力和记忆力。
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 30910, 13, 30910, 49141, 31983, 35959, 32474, 34128, 31211, 13, 13, 30939, 30930, 30910, 31983, 31902, 31651, 31155, 32096, 54725, 40215, 31902, 31903, 31123, 54627, 40657, 31201, 38187, 54746, 35384, 31123, 54558, 32079, 38771, 31740, 31123, 32316, 34779, 31996, 31123, 54724, 35434, 32382, 36490, 31155, 13, 13, 30943, 30930, 30910, 37167, 33296, 31155, 32096, 33777, 47049, 33908, 31201, 34396, 31201, 54580, 55801, 54679, 54542, 34166, 34446, 41635, 35471, 32445, 31123, 32317, 54589, 55611, 31201, 54589, 34166, 54542, 33185, 32357, 31123, 54548, 31983, 35959, 49339, 31155, 13, 13, 30966, 30930, 30910, 34192, 35285, 31155, 34192, 48191, 31740, 44323, 31123, 35315, 32096, 54720, 32444, 30910, 30981, 30941, 30973, 30910, 44442, 34192, 31155, 32775, 34192, 35434, 35763, 32507, 31123, 32079, 31902, 32683, 31123, 54724, 31803, 31937, 34757, 49510, 31155, 2]
labels:

 以下是保持健康的三个提示:

1. 保持身体活动。每天做适当的身体运动,如散步、跑步或游泳,能促进心血管健康,增强肌肉力量,并有助于减少体重。

2. 均衡饮食。每天食用新鲜的蔬菜、水果、全谷物和脂肪含量低的蛋白质食物,避免高糖、高脂肪和加工食品,以保持健康的饮食习惯。

3. 睡眠充足。睡眠对人体健康至关重要,成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力,促进身体恢复,并提高注意力和记忆力。
[INFO|configuration_utils.py:726] 2024-05-04 22:20:07,409 >> loading configuration file config.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\config.json
[INFO|configuration_utils.py:726] 2024-05-04 22:20:27,429 >> loading configuration file config.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\config.json
[INFO|configuration_utils.py:789] 2024-05-04 22:20:27,433 >> Model config ChatGLMConfig {
  "_name_or_path": "THUDM/chatglm3-6b",
  "add_bias_linear": false,
  "add_qkv_bias": true,
  "apply_query_key_layer_scaling": true,
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "ChatGLMModel"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "auto_map": {
    "AutoConfig": "THUDM/chatglm3-6b--configuration_chatglm.ChatGLMConfig",
    "AutoModel": "THUDM/chatglm3-6b--modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForCausalLM": "THUDM/chatglm3-6b--modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSeq2SeqLM": "THUDM/chatglm3-6b--modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSequenceClassification": "THUDM/chatglm3-6b--modeling_chatglm.ChatGLMForSequenceClassification"
  },
  "bias_dropout_fusion": true,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "ffn_hidden_size": 13696,
  "fp32_residual_connection": false,
  "hidden_dropout": 0.0,
  "hidden_size": 4096,
  "kv_channels": 128,
  "layernorm_epsilon": 1e-05,
  "model_type": "chatglm",
  "multi_query_attention": true,
  "multi_query_group_num": 2,
  "num_attention_heads": 32,
  "num_layers": 28,
  "original_rope": true,
  "pad_token_id": 0,
  "padded_vocab_size": 65024,
  "post_layer_norm": true,
  "pre_seq_len": null,
  "prefix_projection": false,
  "quantization_bit": 0,
  "rmsnorm": true,
  "seq_length": 8192,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.40.1",
  "use_cache": true,
  "vocab_size": 65024
}

[INFO|modeling_utils.py:3429] 2024-05-04 22:20:37,556 >> loading weights file model.safetensors from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\model.safetensors.index.json
[INFO|modeling_utils.py:1494] 2024-05-04 22:20:37,560 >> Instantiating ChatGLMForConditionalGeneration model under default dtype torch.float16.
[INFO|configuration_utils.py:928] 2024-05-04 22:20:37,560 >> Generate config GenerationConfig {
  "eos_token_id": 2,
  "pad_token_id": 0
}

Expected behavior

Loss 有数据渲染

System Info


(llama_factory) C:\Users\Administrator\sxf_workspace\LLaMA-Factory>transformers-cli env

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
- `transformers` version: 4.40.1
- Platform: Windows-10-10.0.23560-SP0
- Python version: 3.10.14
- Huggingface_hub version: 0.23.0
- Safetensors version: 0.4.3
- Accelerate version: 0.30.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
- ```

### Others

_No response_
hiyouga commented 5 months ago

报错信息不全

cfanbo commented 5 months ago

报错信息不全

dos命令行只有这些信息

hiyouga commented 5 months ago

只有这些无法提供帮助

cfanbo commented 5 months ago

除了这个地方,哪里还可以看到日志信息? 另外在上面的日志里找到

05/04/2024 22:19:55 - INFO - llmtuner.data.template - Cannot add this chat template to tokenizer.

不知是否这个原因想起的?

hiyouga commented 5 months ago

使用命令行训练试试

cfanbo commented 5 months ago

结果一样的,这里只使用了 alpaca_gpt4_zh 数据集。 是否有可能与cache有关?

(llama_factory) C:\Users\Administrator\sxf_workspace\LLaMA-Factory>llamafactory-cli train --stage sft --do_train True --model_name_or_path THUDM/chatglm3-6b --finetuning_type lora --template chatglm3 --flash_attn auto --dataset_dir data --dataset alpaca_gpt4_zh --cutoff_len 1024 --learning_rate 5e-05 --num_train_epochs 3.0 --max_samples 100000 --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 0 --optim adamw_torch --packing False --report_to none --output_dir savesLM3-6B-Chat_2024-05-04-22-51-59 --fp16 True --lora_rank 8 --lora_alpha 16 --lora_dropout 0 --use_dora True --lora_target all --plot_loss
 True
bin C:\ProgramData\Anaconda3\envs\llama_factory\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll
05/04/2024 22:52:02 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
C:\ProgramData\Anaconda3\envs\llama_factory\lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:52:22,488 >> loading file tokenizer.model from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\tokenizer.model
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:52:22,488 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:52:22,488 >> loading file special_tokens_map.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\special_tokens_map.json
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:52:22,488 >> loading file tokenizer_config.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\tokenizer_config.json
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:52:22,488 >> loading file tokenizer.json from cache at None
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
05/04/2024 22:52:22 - INFO - llmtuner.data.template - Add <|user|>,<|observation|> to stop words.
05/04/2024 22:52:22 - INFO - llmtuner.data.template - Cannot add this chat template to tokenizer.
05/04/2024 22:52:22 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json...
input_ids:
[64790, 64792, 64795, 30910, 13, 30910, 31983, 35959, 32474, 34128, 31155, 64796, 30910, 13, 30910, 49141, 31983, 35959, 32474, 34128, 31211, 13, 13, 30939, 30930, 30910, 31983, 31902, 31651, 31155, 32096, 54725, 40215, 31902, 31903, 31123, 54627, 40657, 31201, 38187, 54746, 35384, 31123, 54558, 32079, 38771, 31740, 31123, 32316, 34779, 31996, 31123, 54724, 35434, 32382, 36490, 31155, 13, 13, 30943, 30930, 30910, 37167, 33296, 31155, 32096, 33777, 47049, 33908, 31201, 34396, 31201, 54580, 55801, 54679, 54542, 34166, 34446, 41635, 35471, 32445, 31123, 32317, 54589, 55611, 31201, 54589, 34166, 54542, 33185, 32357, 31123, 54548, 31983, 35959, 49339, 31155, 13, 13, 30966, 30930, 30910, 34192, 35285, 31155, 34192, 48191, 31740, 44323, 31123, 35315, 32096, 54720, 32444, 30910, 30981, 30941, 30973, 30910, 44442, 34192, 31155, 32775, 34192, 35434, 35763, 32507, 31123, 32079, 31902, 32683, 31123, 54724, 31803, 31937, 34757, 49510, 31155, 2]
inputs:
[gMASK] sop <|user|>
 保持健康的三个提示。 <|assistant|>
 以下是保持健康的三个提示:

1. 保持身体活动。每天做适当的身体运动,如散步、跑步或游泳,能促进心血管健康,增强肌肉力量,并有助于减少体重。

2. 均衡饮食。每天食用新鲜的蔬菜、水果、全谷物和脂肪含量低的蛋白质食物,避免高糖、高脂肪和加工食品,以保持健康的饮食习惯。

3. 睡眠充足。睡眠对人体健康至关重要,成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力,促进身体恢复,并提高注意力和记忆力。
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 30910, 13, 30910, 49141, 31983, 35959, 32474, 34128, 31211, 13, 13, 30939, 30930, 30910, 31983, 31902, 31651, 31155, 32096, 54725, 40215, 31902, 31903, 31123, 54627, 40657, 31201, 38187, 54746, 35384, 31123, 54558, 32079, 38771, 31740, 31123, 32316, 34779, 31996, 31123, 54724, 35434, 32382, 36490, 31155, 13, 13, 30943, 30930, 30910, 37167, 33296, 31155, 32096, 33777, 47049, 33908, 31201, 34396, 31201, 54580, 55801, 54679, 54542, 34166, 34446, 41635, 35471, 32445, 31123, 32317, 54589, 55611, 31201, 54589, 34166, 54542, 33185, 32357, 31123, 54548, 31983, 35959, 49339, 31155, 13, 13, 30966, 30930, 30910, 34192, 35285, 31155, 34192, 48191, 31740, 44323, 31123, 35315, 32096, 54720, 32444, 30910, 30981, 30941, 30973, 30910, 44442, 34192, 31155, 32775, 34192, 35434, 35763, 32507, 31123, 32079, 31902, 32683, 31123, 54724, 31803, 31937, 34757, 49510, 31155, 2]
labels:

 以下是保持健康的三个提示:

1. 保持身体活动。每天做适当的身体运动,如散步、跑步或游泳,能促进心血管健康,增强肌肉力量,并有助于减少体重。

2. 均衡饮食。每天食用新鲜的蔬菜、水果、全谷物和脂肪含量低的蛋白质食物,避免高糖、高脂肪和加工食品,以保持健康的饮食习惯。

3. 睡眠充足。睡眠对人体健康至关重要,成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力,促进身体恢复,并提高注意力和记忆力。
[INFO|configuration_utils.py:726] 2024-05-04 22:52:33,744 >> loading configuration file config.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\config.json
[INFO|configuration_utils.py:726] 2024-05-04 22:52:53,764 >> loading configuration file config.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\config.json
[INFO|configuration_utils.py:789] 2024-05-04 22:52:53,765 >> Model config ChatGLMConfig {
  "_name_or_path": "THUDM/chatglm3-6b",
  "add_bias_linear": false,
  "add_qkv_bias": true,
  "apply_query_key_layer_scaling": true,
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "ChatGLMModel"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "auto_map": {
    "AutoConfig": "THUDM/chatglm3-6b--configuration_chatglm.ChatGLMConfig",
    "AutoModel": "THUDM/chatglm3-6b--modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForCausalLM": "THUDM/chatglm3-6b--modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSeq2SeqLM": "THUDM/chatglm3-6b--modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSequenceClassification": "THUDM/chatglm3-6b--modeling_chatglm.ChatGLMForSequenceClassification"
  },
  "bias_dropout_fusion": true,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "ffn_hidden_size": 13696,
  "fp32_residual_connection": false,
  "hidden_dropout": 0.0,
  "hidden_size": 4096,
  "kv_channels": 128,
  "layernorm_epsilon": 1e-05,
  "model_type": "chatglm",
  "multi_query_attention": true,
  "multi_query_group_num": 2,
  "num_attention_heads": 32,
  "num_layers": 28,
  "original_rope": true,
  "pad_token_id": 0,
  "padded_vocab_size": 65024,
  "post_layer_norm": true,
  "pre_seq_len": null,
  "prefix_projection": false,
  "quantization_bit": 0,
  "rmsnorm": true,
  "seq_length": 8192,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.40.1",
  "use_cache": true,
  "vocab_size": 65024
}

[INFO|modeling_utils.py:3429] 2024-05-04 22:53:03,848 >> loading weights file model.safetensors from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\model.safetensors.index.json
[INFO|modeling_utils.py:1494] 2024-05-04 22:53:03,852 >> Instantiating ChatGLMForConditionalGeneration model under default dtype torch.float16.
[INFO|configuration_utils.py:928] 2024-05-04 22:53:03,853 >> Generate config GenerationConfig {
  "eos_token_id": 2,
  "pad_token_id": 0
}

(llama_factory) C:\Users\Administrator\sxf_workspace\LLaMA-Factory>
cfanbo commented 5 months ago

怀疑是显存不足的原因,但又没有任何错误信息。后来换了一台 22G 显存的机器训练正常的。