nohup sh ppo_qwen.sh > ppo_qwen.log 2>&1 &

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \ --seed 38 \ --stage ppo \ --do_train \ --model_name_or_path Qwen/Qwen1.5-72B-Chat \ --adapter_name_or_path NNVV_classify_reason/sft_qwen15_72B_1_1/checkpoint-1000 \ --dataset NNVV_train,NNVV_reason_train \ --template qwen \ --finetuning_type lora \ --lora_target c_attn \ --cutoff_len 3000 \ --max_new_tokens 2000 \ --reward_model_type api \ --reward_model http://0.0.0.0:8000/v1/score/evaluation \ --output_dir PPO \ --overwrite_output_dir \ --overwrite_cache \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 1 \ --save_steps 10 \ --learning_rate 1e-5 \ --num_train_epochs 5 \ --plot_loss \ --top_k 0 \ --top_p 0.9 \ --fp16 \ --quantization_bit 4 \ --do_sample True

Expected behavior

System Info

No response

Others

loss为负数大概是什么原因呀？对模型训练效果有影响吗？模型PPO之后效果急剧下降，不知道是不是这个原因。

BIT-Xu commented 7 months ago

我将reward模型输出的评分加了一层sigmod，不知道这个有影响没

hiyouga commented 7 months ago

有负数不会影响

hiyouga / LLaMA-Factory

PPO训练过程中loss为负数 #3185

Reminder

Reproduction

nohup sh ppo_qwen.sh > ppo_qwen.log 2>&1 &

Expected behavior

System Info

Others