princeton-nlp / SimPO

[NeurIPS 2024] SimPO: Simple Preference Optimization with a Reference-Free Reward
MIT License
729 stars 50 forks source link

Question about tuning set #74

Open yakazimir opened 2 weeks ago

yakazimir commented 2 weeks ago

Did you use a special validation set for ultrafeedback when tuning the hyper-paramaters in Table 7, or just the test_pref set from the original binarized ultrafeedback data. I notice that in your on-policy datasets you only include train and test splits (and that the test splits are exactly the test_pref instances).

yumeng5 commented 1 week ago

Hi @yakazimir

Yes, we did include a test split for the UltraFeedback preference data. In our early experiment, we found that the win rates on the prompts from the test split are strongly correlated with the win rates on chat benchmarks (e.g., AlpacaEval 2 and Arena-Hard). So in principle, the test split can be used for hyperparameter tuning.

However, this test split has more instances than AlpacaEval 2 (~800 prompts) and Arena-Hard (~500 prompts) and incurs a higher cost (e.g., LLM-as-judge APIs) for hyperparameter tuning. Therefore, in practice, we did hyperparameter tuning based on the benchmark scores across all the evaluation sets. Given the high correlation between the win rates on these benchmarks (e.g., AlpacaEval 2 & Arena-Hard) and human evaluations, these scores provide a good proxy for hyperparameter tuning using human judgments.

Best, Yu