Open lucasliunju opened 5 days ago
Hi Lucas @lucasliunju,
Thank you for your interest!
May I know which preference oracle did you use? All results in the paper are based on the setting where Skywork-Reward-Llama-3.1-8B
is used as the preference oracle, while in the README I used llm-blender/PairRM
to show a more lightweight example for quick experimentation.
Switching from llm-blender/PairRM
to Skywork-Reward-Llama-3.1-8B
is necessary when we apply strong online alignment algorithms, because they are more likely to exploit the preference oracle (and may lead to "oracle hacking"). Note that an ideal preference oracle should be a population, but we use strong reward models (or using GPT-as-a-judge in Section 6.4) to simulate it for the sake of accessibility. Here are some examples for using Skywork-Reward-Llama-3.1-8B
or gpt-4o-mini
as preference oracles.
Please also find reproduced curves at wandb: https://wandb.ai/lkevinzc/oat-llm (under the filter "reward_oracle:remote"), where you can see the learning curves closely follow those in the paper.
Best regards, Zichen
Hi,
Thanks for your great work.
I try to reproduce the results of offline dpo and offline simpo and I found the reproduced resltus are better the results in the paper. For example, for the results in the Figure 5 about dpo. I find the reproduced results about dpo is 76, which is better than the results in the paper (about 70). For the reproduced results of offline simpo, it illustrates a similar phenomenon, the reproduced result is 79, which is higher than the results in the readme of this repo. I would like to ask whether this is a normal phenomenon.
Thank you in advance!
Lucas