reward/chosen is decreasing

princeton-nlp / SimPO

[NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward

MIT License

733 stars 50 forks source link

reward/chosen is decreasing #42

Open zhangguoxin1 opened 4 months ago

zhangguoxin1 commented 4 months ago

Hi! I am fine-tuning LLaMA3 on the hh-rlhf dataset using SimPo and noticed that the reward/chosen reward is decreasing. Is this reasonable? `# SimPOTrainer arguments

bf16: true beta: 2.5 gamma: 1.4 per_device_train_batch_size: 2 per_device_eval_batch_size: 4 do_eval: true eval_strategy: steps eval_steps: 500 gradient_accumulation_steps: 8 gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: False learning_rate: 5.0e-5 num_train_epochs: 1 log_level: info logging_steps: 5 lr_scheduler_type: cosine max_length: 2048 max_prompt_length: 1800 optim: adamw_torch output_dir: outputs/llama-3-8b-instruct-simpo-hh run_name: llama-3-8b-instruct-simpo-hh force_use_ref_model: True push_to_hub: false save_strategy: "steps" save_steps: 500 remove_unused_columns: False save_total_limit: 20 seed: 42 warmup_ratio: 0.1 `

zhangguoxin1 commented 4 months ago

I expected the reward/chosen to increase, but since the goal of SimPo is to maximize the difference between reward/chosen and reward/rejected, it is acceptable for reward/chosen to decrease to a certain extent. However, the extent of the decrease in reward/chosen seems a bit large compared to reward/chosen - reward/rejected.

yumeng5 commented 4 months ago

Hi,

Yes, this is reasonable. The reward margin should increase but the reward on chosen responses may slightly decrease (and the reward on rejected decreases more rapidly). In general, we don't want the reward on chosen to decrease too much (as that implies the likelihood of chosen responses is decreasing), and you may use a larger beta or a lower learning rate to mitigate the decrease of reward on chosen responses.

Best, Yu

zhangguoxin1 commented 4 months ago

get it!

Thanks for the quick reply.

zhangguoxin1 commented 3 months ago

Hi, I used Simpo in my task with qwen2_7B (there are approximately 40,000 data entries), but the model generated repeated sentences and pre-trained data. The parameters are as follows:

pref_beta: 2.5
simpo_gamma: 1.0
learning_rate: 1.0e-6
num_train_epochs: 3.0

and I'm try use a larger beta=8.0

xiamengzhou commented 3 months ago

@zhangguoxin1 I think you should be using Qwen2-7B-Instruct rather than Qwen2-7B if you only running PO? Also I'd suggest that you use online data rather offline data that is generated by other models.

yumeng5 commented 3 months ago

Hi @zhangguoxin1

In addition to the suggestions by Mengzhou, you may try the following as well:

decrease the learning rate (we usually start learning rate search around 5e-7)
reduce the number of training epochs (we generally train the model for only one epoch)

Best, Yu