Question about tuning set

Hi @yakazimir

Yes, we did include a test split for the UltraFeedback preference data. In our early experiment, we found that the win rates on the prompts from the test split are strongly correlated with the win rates on chat benchmarks (e.g., AlpacaEval 2 and Arena-Hard). So in principle, the test split can be used for hyperparameter tuning.

However, this test split has more instances than AlpacaEval 2 (~800 prompts) and Arena-Hard (~500 prompts) and incurs a higher cost (e.g., LLM-as-judge APIs) for hyperparameter tuning. Therefore, in practice, we did hyperparameter tuning based on the benchmark scores across all the evaluation sets. Given the high correlation between the win rates on these benchmarks (e.g., AlpacaEval 2 & Arena-Hard) and human evaluations, these scores provide a good proxy for hyperparameter tuning using human judgments.

Best, Yu

princeton-nlp / SimPO

Question about tuning set #74