Closed SihengLi99 closed 6 hours ago
Hi,
Thanks for the question. SimPO and RRHF both use average log-likelihood as the reward. The main differences are:
We tested RRHF (with and without the SFT loss) under the Mistral-Base setup and tuned the learning rates in [5e-7, 1e-6, 5e-6, 1e-5]. The best performance of RRHF measured by AlpacaEval 2 length-controlled win rate (LC) and raw win rate (WR) are as follows: | Method | LC (%) | WR (%) |
---|---|---|---|
DPO | 15.1 | 12.5 | |
SimPO | 21.5 | 20.8 | |
RRHF | 11.6 | 10.2 | |
RRHF w/o SFT | degenerate | degenerate |
where RRHF w/o SFT always generates repetitive patterns without following instructions regardless of the learning rates.
Given these observations, we believe that the Bradley-Terry ranking loss is essential when using average log-likelihood as the reward formulation. It is also possible that hinge-like ranking losses could work with thorough tuning of the margin, and we leave that for future studies.
I hope this helps!
Best, Yu
Hi,
Thanks for your answer. It is very insightful for me!
Best regards, Siheng
Yu Meng @.***> 于2024年5月29日周三 05:06写道:
Hi,
Thanks for the question. SimPO and RRHF both use average log-likelihood as the reward. The main differences are:
- RRHF has an SFT loss whereas SimPO doesn't
- RRHF uses a hinge-like ranking loss (without a margin) whereas SimPO uses the Bradley-Terry ranking loss with a margin
We tested RRHF (with and without the SFT loss) under the Mistral-Base setup and tuned the learning rates in [5e-7, 1e-6, 5e-6, 1e-5]. The best performance of RRHF measured by AlpacaEval 2 length-controlled win rate (LC) and raw win rate (WR) are as follows: Method LC (%) WR (%) DPO 15.1 12.5 SimPO 21.5 20.8 RRHF 11.6 10.2 RRHF w/o SFT degenerate degenerate
where RRHF w/o SFT always generates repetitive patterns without following instructions regardless of the learning rates.
Given these observations, we believe that the Bradley-Terry ranking loss is essential when using average log-likelihood as the reward formulation. It is also possible that hinge-like ranking losses could work with thorough tuning of the margin, and we leave that for future studies.
I hope this helps!
Best, Yu
— Reply to this email directly, view it on GitHub https://github.com/princeton-nlp/SimPO/issues/5#issuecomment-2136102644, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR4KH4L7KRN65PAXHN2746TZETWTTAVCNFSM6AAAAABIJ42L5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZWGEYDENRUGQ . You are receiving this because you authored the thread.Message ID: @.***>
Hi,
Very insightful work! While I have some question about the relation between RRHF[1] and SimPO. Could you please give a brief introduction?
Thanks!
[1] RRHF: Rank responses to align language models with human feedback. In NeurIPS, 2023.