Open yakazimir opened 2 weeks ago
Hi @yakazimir
Yes, we did include a test split for the UltraFeedback preference data. In our early experiment, we found that the win rates on the prompts from the test split are strongly correlated with the win rates on chat benchmarks (e.g., AlpacaEval 2 and Arena-Hard). So in principle, the test split can be used for hyperparameter tuning.
However, this test split has more instances than AlpacaEval 2 (~800 prompts) and Arena-Hard (~500 prompts) and incurs a higher cost (e.g., LLM-as-judge APIs) for hyperparameter tuning. Therefore, in practice, we did hyperparameter tuning based on the benchmark scores across all the evaluation sets. Given the high correlation between the win rates on these benchmarks (e.g., AlpacaEval 2 & Arena-Hard) and human evaluations, these scores provide a good proxy for hyperparameter tuning using human judgments.
Best, Yu
Did you use a special validation set for ultrafeedback when tuning the hyper-paramaters in Table 7, or just the
test_pref
set from the original binarized ultrafeedback data. I notice that in your on-policy datasets you only includetrain
andtest
splits (and that the test splits are exactly thetest_pref
instances).