Reproducing the results of Simpo and dpo

Hi Lucas @lucasliunju,

Thank you for your interest!

May I know which preference oracle did you use? All results in the paper are based on the setting where Skywork-Reward-Llama-3.1-8B is used as the preference oracle, while in the README I used llm-blender/PairRM to show a more lightweight example for quick experimentation.

Switching from llm-blender/PairRM to Skywork-Reward-Llama-3.1-8B is necessary when we apply strong online alignment algorithms, because they are more likely to exploit the preference oracle (and may lead to "oracle hacking"). Note that an ideal preference oracle should be a population, but we use strong reward models (or using GPT-as-a-judge in Section 6.4) to simulate it for the sake of accessibility. Here are some examples for using Skywork-Reward-Llama-3.1-8B or gpt-4o-mini as preference oracles.

Please also find reproduced curves at wandb: https://wandb.ai/lkevinzc/oat-llm (under the filter "reward_oracle:remote"), where you can see the learning curves closely follow those in the paper.

Best regards, Zichen

sail-sg / oat

Reproducing the results of Simpo and dpo #10