Open zhangguoxin1 opened 4 months ago
I expected the reward/chosen to increase, but since the goal of SimPo is to maximize the difference between reward/chosen and reward/rejected, it is acceptable for reward/chosen to decrease to a certain extent. However, the extent of the decrease in reward/chosen seems a bit large compared to reward/chosen - reward/rejected.
Hi,
Yes, this is reasonable. The reward margin should increase but the reward on chosen responses may slightly decrease (and the reward on rejected decreases more rapidly). In general, we don't want the reward on chosen to decrease too much (as that implies the likelihood of chosen responses is decreasing), and you may use a larger beta
or a lower learning rate to mitigate the decrease of reward on chosen responses.
Best, Yu
get it!
Thanks for the quick reply.
Hi, I used Simpo in my task with qwen2_7B (there are approximately 40,000 data entries), but the model generated repeated sentences and pre-trained data. The parameters are as follows:
pref_beta: 2.5
simpo_gamma: 1.0
learning_rate: 1.0e-6
num_train_epochs: 3.0
and I'm try use a larger beta=8.0
@zhangguoxin1 I think you should be using Qwen2-7B-Instruct rather than Qwen2-7B if you only running PO? Also I'd suggest that you use online data rather offline data that is generated by other models.
Hi @zhangguoxin1
In addition to the suggestions by Mengzhou, you may try the following as well:
Best, Yu
Hi! I am fine-tuning LLaMA3 on the hh-rlhf dataset using SimPo and noticed that the reward/chosen reward is decreasing. Is this reasonable? `# SimPOTrainer arguments
bf16: true beta: 2.5 gamma: 1.4 per_device_train_batch_size: 2 per_device_eval_batch_size: 4 do_eval: true eval_strategy: steps eval_steps: 500 gradient_accumulation_steps: 8 gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: False learning_rate: 5.0e-5 num_train_epochs: 1 log_level: info logging_steps: 5 lr_scheduler_type: cosine max_length: 2048 max_prompt_length: 1800 optim: adamw_torch output_dir: outputs/llama-3-8b-instruct-simpo-hh run_name: llama-3-8b-instruct-simpo-hh force_use_ref_model: True push_to_hub: false save_strategy: "steps" save_steps: 500 remove_unused_columns: False save_total_limit: 20 seed: 42 warmup_ratio: 0.1 `