sail-sg / oat

🌾 OAT: Online AlignmenT for LLMs
https://arxiv.org/pdf/2411.01493
Apache License 2.0
29 stars 2 forks source link

Reproducing the results of Simpo and dpo #10

Open lucasliunju opened 5 days ago

lucasliunju commented 5 days ago

Hi,

Thanks for your great work.

I try to reproduce the results of offline dpo and offline simpo and I found the reproduced resltus are better the results in the paper. For example, for the results in the Figure 5 about dpo. I find the reproduced results about dpo is 76, which is better than the results in the paper (about 70). For the reproduced results of offline simpo, it illustrates a similar phenomenon, the reproduced result is 79, which is higher than the results in the readme of this repo. I would like to ask whether this is a normal phenomenon.

Thank you in advance!

Lucas

lkevinzc commented 3 days ago

Hi Lucas @lucasliunju,

Thank you for your interest!

May I know which preference oracle did you use? All results in the paper are based on the setting where Skywork-Reward-Llama-3.1-8B is used as the preference oracle, while in the README I used llm-blender/PairRM to show a more lightweight example for quick experimentation.

Switching from llm-blender/PairRM to Skywork-Reward-Llama-3.1-8B is necessary when we apply strong online alignment algorithms, because they are more likely to exploit the preference oracle (and may lead to "oracle hacking"). Note that an ideal preference oracle should be a population, but we use strong reward models (or using GPT-as-a-judge in Section 6.4) to simulate it for the sake of accessibility. Here are some examples for using Skywork-Reward-Llama-3.1-8B or gpt-4o-mini as preference oracles.

Please also find reproduced curves at wandb: https://wandb.ai/lkevinzc/oat-llm (under the filter "reward_oracle:remote"), where you can see the learning curves closely follow those in the paper.

Best regards, Zichen