使用 THUDM/chatglm3-6b“ 和默认数据集训练j时一直提示 "Failed" #3569

Closed cfanbo closed 5 months ago

cfanbo commented 5 months ago




这里是PC机器,RTX 3070 显卡 16G ,训练时提示“Failed.” 结果,导致 "Loss" 无任何数据。


CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path THUDM/chatglm3-6b \
    --finetuning_type lora \
    --template chatglm3 \
    --flash_attn auto \
    --dataset_dir data \
    --dataset alpaca_gpt4_zh,identity \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --optim adamw_torch \
    --packing False \
    --report_to none \
    --output_dir saves\ChatGLM3-6B-Chat\lora\train_2024-05-04-22-08-55 \
    --fp16 True \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0 \
    --use_dora True \
    --lora_target all \
    --plot_loss True


(llama_factory) C:\Users\Administrator\sxf_workspace\LLaMA-Factory>llamafactory-cli webui
bin C:\ProgramData\Anaconda3\envs\llama_factory\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll
Running on local URL:

To create a public link, set `share=True` in `launch()`.
bin C:\ProgramData\Anaconda3\envs\llama_factory\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll
05/04/2024 22:19:35 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
C:\ProgramData\Anaconda3\envs\llama_factory\lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:19:55,214 >> loading file tokenizer.model from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\tokenizer.model
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:19:55,215 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:19:55,215 >> loading file special_tokens_map.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\special_tokens_map.json
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:19:55,216 >> loading file tokenizer_config.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\tokenizer_config.json
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:19:55,216 >> loading file tokenizer.json from cache at None
05/04/2024 22:19:55 - INFO - llmtuner.data.template - Add <|user|>,<|observation|> to stop words.
05/04/2024 22:19:55 - INFO - llmtuner.data.template - Cannot add this chat template to tokenizer.
05/04/2024 22:19:55 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json...
05/04/2024 22:19:56 - INFO - llmtuner.data.loader - Loading dataset identity.json...
[INFO|modeling_utils.py:3429] 2024-05-04 22:20:37,556 >> loading weights file model.safetensors from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\model.safetensors.index.json
[INFO|modeling_utils.py:1494] 2024-05-04 22:20:37,560 >> Instantiating ChatGLMForConditionalGeneration model under default dtype torch.float16.
[INFO|configuration_utils.py:928] 2024-05-04 22:20:37,560 >> Generate config GenerationConfig {
  "eos_token_id": 2,
  "pad_token_id": 0

Expected behavior

Loss 有数据渲染

System Info

(llama_factory) C:\Users\Administrator\sxf_workspace\LLaMA-Factory>transformers-cli env

- `transformers` version: 4.40.1
- Platform: Windows-10-10.0.23560-SP0
- Python version: 3.10.14
- Huggingface_hub version: 0.23.0
- Safetensors version: 0.4.3
- Accelerate version: 0.30.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
### Others

hiyouga commented 5 months ago


cfanbo commented 5 months ago



hiyouga commented 5 months ago


cfanbo commented 5 months ago

除了这个地方,哪里还可以看到日志信息? 另外在上面的日志里找到

05/04/2024 22:19:55 - INFO - llmtuner.data.template - Cannot add this chat template to tokenizer.


hiyouga commented 5 months ago


cfanbo commented 5 months ago

结果一样的,这里只使用了 alpaca_gpt4_zh 数据集。 是否有可能与cache有关?

(llama_factory) C:\Users\Administrator\sxf_workspace\LLaMA-Factory>llamafactory-cli train --stage sft --do_train True --model_name_or_path THUDM/chatglm3-6b --finetuning_type lora --template chatglm3 --flash_attn auto --dataset_dir data --dataset alpaca_gpt4_zh --cutoff_len 1024 --learning_rate 5e-05 --num_train_epochs 3.0 --max_samples 100000 --per_device_train_batch_size 2 --gradient_accumulation_steps 8 --lr_scheduler_type cosine --max_grad_norm 1.0 --logging_steps 5 --save_steps 100 --warmup_steps 0 --optim adamw_torch --packing False --report_to none --output_dir savesLM3-6B-Chat_2024-05-04-22-51-59 --fp16 True --lora_rank 8 --lora_alpha 16 --lora_dropout 0 --use_dora True --lora_target all --plot_loss
bin C:\ProgramData\Anaconda3\envs\llama_factory\lib\site-packages\bitsandbytes\libbitsandbytes_cuda121.dll
05/04/2024 22:52:02 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
C:\ProgramData\Anaconda3\envs\llama_factory\lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:52:22,488 >> loading file tokenizer.model from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\tokenizer.model
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:52:22,488 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:52:22,488 >> loading file special_tokens_map.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\special_tokens_map.json
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:52:22,488 >> loading file tokenizer_config.json from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\tokenizer_config.json
[INFO|tokenization_utils_base.py:2087] 2024-05-04 22:52:22,488 >> loading file tokenizer.json from cache at None
Setting eos_token is not supported, use the default one.
Setting pad_token is not supported, use the default one.
Setting unk_token is not supported, use the default one.
05/04/2024 22:52:22 - INFO - llmtuner.data.template - Add <|user|>,<|observation|> to stop words.
05/04/2024 22:52:22 - INFO - llmtuner.data.template - Cannot add this chat template to tokenizer.
05/04/2024 22:52:22 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json...
[INFO|modeling_utils.py:3429] 2024-05-04 22:53:03,848 >> loading weights file model.safetensors from cache at C:\Users\Administrator\.cache\huggingface\hub\models--THUDM--chatglm3-6b\snapshots\103caa40027ebfd8450289ca2f278eac4ff26405\model.safetensors.index.json
[INFO|modeling_utils.py:1494] 2024-05-04 22:53:03,852 >> Instantiating ChatGLMForConditionalGeneration model under default dtype torch.float16.
[INFO|configuration_utils.py:928] 2024-05-04 22:53:03,853 >> Generate config GenerationConfig {
  "eos_token_id": 2,
  "pad_token_id": 0

(llama_factory) C:\Users\Administrator\sxf_workspace\LLaMA-Factory>
cfanbo commented 5 months ago

怀疑是显存不足的原因,但又没有任何错误信息。后来换了一台 22G 显存的机器训练正常的。