hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
32.67k stars 4.01k forks source link

PPO训练时Reward突然变成负数,loss出现突刺 #2813

Closed koking0 closed 6 months ago

koking0 commented 7 months ago

Reminder

Reproduction

 cd LLaMA-Factory && HF_ENDPOINT=https://hf-mirror.com accelerate launch src/train_bash.py \
     --stage sft \
     --do_train \
     --model_name_or_path codellama/CodeLlama-7b-Python-hf \
     --dataset codealpaca,codeforces_python_submissions_sft \
     --template default \
     --finetuning_type lora \
     --lora_target q_proj,v_proj \
     --output_dir output/sft/test_train \
     --overwrite_cache \
     --per_device_train_batch_size 4 \
     --gradient_accumulation_steps 4 \
     --lr_scheduler_type cosine \
     --logging_steps 10 \
     --save_steps 500 \
     --learning_rate 5e-5 \
     --num_train_epochs 3.0 \
     --plot_loss \
     --fp16

cd LLaMA-Factory && HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 accelerate launch src/train_bash.py \
   --stage rm \
   --do_train \
   --model_name_or_path codellama/CodeLlama-7b-Python-hf \
   --adapter_name_or_path output/sft/test_train \
   --create_new_adapter \
   --dataset codeforces_python_submissions_rl \
   --template default \
   --finetuning_type lora \
   --lora_target q_proj,v_proj \
   --output_dir output/rm/test_train \
   --per_device_train_batch_size 1 \
   --gradient_accumulation_steps 16 \
   --lr_scheduler_type cosine \
   --logging_steps 10 \
   --save_steps 500 \
   --learning_rate 1e-4 \
   --num_train_epochs 1.0 \
   --plot_loss \
   --fp16

cd LLaMA-Factory && HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 accelerate launch src/train_bash.py \
    --stage ppo \
    --do_train \
    --model_name_or_path codellama/CodeLlama-7b-Python-hf \
    --adapter_name_or_path output/sft/test_train \
    --create_new_adapter \
    --dataset codealpaca,codeforces_python_submissions_sft \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --reward_model output/rm/test_train \
    --output_dir output/ppo/test_train \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --lr_scheduler_type cosine \
    --top_k 0 \
    --top_p 0.8 \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --plot_loss \
    --fp16

Expected behavior

在PPO训练阶段,突然出现loss突刺,reward也变成了负数,这有可能是啥原因呢?

ee5710c8b42e764de55e5a02a54523e

System Info

No response

Others

No response

lzw-lzw commented 7 months ago

我也遇到了这个问题,请问你解决了吗?

Reminder

  • [x] I have read the README and searched the existing issues.

Reproduction

 cd LLaMA-Factory && HF_ENDPOINT=https://hf-mirror.com accelerate launch src/train_bash.py \
     --stage sft \
     --do_train \
     --model_name_or_path codellama/CodeLlama-7b-Python-hf \
     --dataset codealpaca,codeforces_python_submissions_sft \
     --template default \
     --finetuning_type lora \
     --lora_target q_proj,v_proj \
     --output_dir output/sft/test_train \
     --overwrite_cache \
     --per_device_train_batch_size 4 \
     --gradient_accumulation_steps 4 \
     --lr_scheduler_type cosine \
     --logging_steps 10 \
     --save_steps 500 \
     --learning_rate 5e-5 \
     --num_train_epochs 3.0 \
     --plot_loss \
     --fp16

cd LLaMA-Factory && HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 accelerate launch src/train_bash.py \
   --stage rm \
   --do_train \
   --model_name_or_path codellama/CodeLlama-7b-Python-hf \
   --adapter_name_or_path output/sft/test_train \
   --create_new_adapter \
   --dataset codeforces_python_submissions_rl \
   --template default \
   --finetuning_type lora \
   --lora_target q_proj,v_proj \
   --output_dir output/rm/test_train \
   --per_device_train_batch_size 1 \
   --gradient_accumulation_steps 16 \
   --lr_scheduler_type cosine \
   --logging_steps 10 \
   --save_steps 500 \
   --learning_rate 1e-4 \
   --num_train_epochs 1.0 \
   --plot_loss \
   --fp16

cd LLaMA-Factory && HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 accelerate launch src/train_bash.py \
    --stage ppo \
    --do_train \
    --model_name_or_path codellama/CodeLlama-7b-Python-hf \
    --adapter_name_or_path output/sft/test_train \
    --create_new_adapter \
    --dataset codealpaca,codeforces_python_submissions_sft \
    --template default \
    --finetuning_type lora \
    --lora_target q_proj,v_proj \
    --reward_model output/rm/test_train \
    --output_dir output/ppo/test_train \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --lr_scheduler_type cosine \
    --top_k 0 \
    --top_p 0.8 \
    --logging_steps 10 \
    --save_steps 100 \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --plot_loss \
    --fp16

Expected behavior

在PPO训练阶段,突然出现loss突刺,reward也变成了负数,这有可能是啥原因呢?

ee5710c8b42e764de55e5a02a54523e

System Info

No response

Others

No response

koking0 commented 7 months ago

没有,也在排查SFT和RM。

cabisarri commented 5 months ago

可以看看 error analysis, 比如对比一下你那个出现下降 reward 的那个部分 和出现负数的部分,可能是 policy 模型没学习,从而和 SFT 模型的输出很相似,在公式上如果很相近,也会导致reward 下降,你可以打印一下这个值 \pi(x,y)/f_{sft}(x,y), 或者看看 advantage 的值,正常是要要对 adavange 进行裁剪的在,default 应该是包含了吧,不过可以查查。还有就是你可能的 sample 方式还是没学到足够好的,我看你用了 top-p, top-k 可能试试 best-of-n ,生成多个,取最好的去给模型学习