princeton-nlp / SimPO

SimPO: Simple Preference Optimization with a Reference-Free Reward
489 stars 29 forks source link

Question about RRHF #5

Closed SihengLi99 closed 6 hours ago

SihengLi99 commented 1 month ago

Hi,

Very insightful work! While I have some question about the relation between RRHF[1] and SimPO. Could you please give a brief introduction?

Thanks!

[1] RRHF: Rank responses to align language models with human feedback. In NeurIPS, 2023.

yumeng5 commented 1 month ago

Hi,

Thanks for the question. SimPO and RRHF both use average log-likelihood as the reward. The main differences are:

We tested RRHF (with and without the SFT loss) under the Mistral-Base setup and tuned the learning rates in [5e-7, 1e-6, 5e-6, 1e-5]. The best performance of RRHF measured by AlpacaEval 2 length-controlled win rate (LC) and raw win rate (WR) are as follows: Method LC (%) WR (%)
DPO 15.1 12.5
SimPO 21.5 20.8
RRHF 11.6 10.2
RRHF w/o SFT degenerate degenerate

where RRHF w/o SFT always generates repetitive patterns without following instructions regardless of the learning rates.

Given these observations, we believe that the Bradley-Terry ranking loss is essential when using average log-likelihood as the reward formulation. It is also possible that hinge-like ranking losses could work with thorough tuning of the margin, and we leave that for future studies.

I hope this helps!

Best, Yu

SihengLi99 commented 1 month ago

Hi,

Thanks for your answer. It is very insightful for me!

Best regards, Siheng

Yu Meng @.***> 于2024年5月29日周三 05:06写道:

Hi,

Thanks for the question. SimPO and RRHF both use average log-likelihood as the reward. The main differences are:

  • RRHF has an SFT loss whereas SimPO doesn't
  • RRHF uses a hinge-like ranking loss (without a margin) whereas SimPO uses the Bradley-Terry ranking loss with a margin

We tested RRHF (with and without the SFT loss) under the Mistral-Base setup and tuned the learning rates in [5e-7, 1e-6, 5e-6, 1e-5]. The best performance of RRHF measured by AlpacaEval 2 length-controlled win rate (LC) and raw win rate (WR) are as follows: Method LC (%) WR (%) DPO 15.1 12.5 SimPO 21.5 20.8 RRHF 11.6 10.2 RRHF w/o SFT degenerate degenerate

where RRHF w/o SFT always generates repetitive patterns without following instructions regardless of the learning rates.

Given these observations, we believe that the Bradley-Terry ranking loss is essential when using average log-likelihood as the reward formulation. It is also possible that hinge-like ranking losses could work with thorough tuning of the margin, and we leave that for future studies.

I hope this helps!

Best, Yu

— Reply to this email directly, view it on GitHub https://github.com/princeton-nlp/SimPO/issues/5#issuecomment-2136102644, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR4KH4L7KRN65PAXHN2746TZETWTTAVCNFSM6AAAAABIJ42L5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZWGEYDENRUGQ . You are receiving this because you authored the thread.Message ID: @.***>