报错INFO - llmtuner.model.utils - Failed to load pytorch_model.bin

Eugene-Zh commented 10 months ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

OUTPUT= OUTPUT_PATH LR=1e-6 mkdir -p $OUTPUT

CUDA_VISIBLE_DEVICES='3' python src/train_bash.py \ --stage ppo \ --do_train \ --model_name_or_path BASE_MODEL_PATH \ --adapter_name_or_path LORA_CHECKPOINT_PATH \ --create_new_adapter \ --dataset step3_train \ --template baichuan2 \ --finetuning_type lora \ --lora_target W_pack \ --reward_model_type full \ --reward_model RM_PATH_LORA_EXPORTED \ --output_dir $OUTPUT \ --overwrite_output_dir True \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --top_k 0 \ --top_p 0.9 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --num_train_epochs 1.0 \ --plot_loss \ --overwrite_output_dir \ --bf16 \ 2>&1 | tee $OUTPUT/training.log

Expected behavior

想要训练PPO，但是在这个过程发现INFO有些异常。看到issue中有类似问题，解决为更新代码到最新，更新到LLaMA-Factory0.4.0版本，仍出现上述问题。

System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.36.2
Platform: Linux-5.15.0-25-generic-x86_64-with-glibc2.35
Python version: 3.11.7
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.1
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No, tried but has the same problem

文件信息

LORA_CHECKPOINT_PATH ├── adapter_config.json ├── adapter_model.bin ├── optimizer.pt ├── README.md ├── rng_state.pth ├── scheduler.pt ├── special_tokens_map.json ├── tokenization_baichuan.py ├── tokenizer_config.json ├── tokenizer.model ├── trainer_state.json └── training_args.bin RM_PATH_LORA_EXPORTED ├── config.json ├── configuration_baichuan.py ├── generation_config.json ├── generation_utils.py ├── modeling_baichuan.py ├── pytorch_model-00001-of-00002.bin ├── pytorch_model-00002-of-00002.bin ├── pytorch_model.bin.index.json ├── quantizer.py ├── special_tokens_map.json ├── tokenization_baichuan.py ├── tokenizer_config.json └── tokenizer.model

Others

报错信息：

01/13/2024 15:19:09 - INFO - llmtuner.model.utils - Failed to load model.safetensors: /LORA_CHECKPOINT_PATH does not appear to have a file named model.safetensors. Checkout 'https://huggingface.co//LORA_CHECKPOINT_PATH/None' for available files. 01/13/2024 15:19:09 - INFO - llmtuner.model.utils - Failed to load pytorch_model.bin: /LORA_CHECKPOINT_PATH does not appear to have a file named pytorch_model.bin. Checkout 'https://huggingface.co//LORA_CHECKPOINT_PATH' for available files. 01/13/2024 15:19:09 - WARNING - llmtuner.model.utils - Provided path (LORA_CHECKPOINT_PATH) does not contain valuehead weights. 01/13/2024 15:19:09 - INFO - llmtuner.model.loader - trainable params: 6558721 || all params: 13903226881 || trainable%: 0.0472 input_ids:

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:05<00:05, 5.44s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00, 3.98s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00, 4.20s/it] 01/13/2024 15:19:18 - INFO - llmtuner.model.adapter - Adapter is not found at evaluation, load the base model. 01/13/2024 15:19:18 - INFO - llmtuner.model.utils - Failed to load model.safetensors: RM_PATH_LORA_EXPORTED does not appear to have a file named model.safetensors. Checkout 'https://huggingface.co//RM_PATH_LORA_EXPORTED/None' for available files. 01/13/2024 15:19:18 - INFO - llmtuner.model.utils - Failed to load pytorch_model.bin: RM_PATH_LORA_EXPORTED does not appear to have a file named pytorch_model.bin. Checkout 'https://huggingface.co//RM_PATH_LORA_EXPORTED/None' for available files.

是因为参数bin文件过大拆成两个之后就无法读取么？

另外还有一个疑问

You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. 这段信息要如何处理呢，是要修改模型的某个文件么？_set_gradient_checkpointing这个方法在哪里呢？我有试着重新训练SFT、RM等，但是在新代码的新训练过程还会出现这个信息。希望能获得您的进一步解答，非常感谢

hiyouga commented 10 months ago

最近更新了保存逻辑，需要重新训练 reward model

hiyouga commented 10 months ago

后面的 gradient checkpointing warning 忽视即可

hiyouga / LLaMA-Factory