lm-sys / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
316 stars 29 forks source link

CI results different for same model answer copy #8

Closed qingquansong closed 2 months ago

qingquansong commented 2 months ago

Hey Team,

Thanks for sharing the benchmarks! I'm testing the scripts by simply copy a model answer as well as the judgement and change the model ids in each jsonl file, but got different CI results after showing as below:

gpt-4-0613-copy                  | score: 37.9  | 95% CI: (-2.8, 2.7)  | average #tokens: 354
gpt-4-0613                       | score: 37.9  | 95% CI: (-2.7, 3.0)  | average #tokens: 354

Is it related to the bootstrapping? Thanks!

Best regards, QQ

CodingWithTim commented 2 months ago

Hi, interesting find! This is because we are only doing 100 rounds of bootstrapping which isn't the most precise CI intervals. There is a randomness involved when bootstrapping, so the more rounds the better. We didn't do more because it would take longer for user to generate the score. You can try 500 rounds of bootstrapping or even 1000 rounds, the CI interval should stabilizes around then. But 100 rounds is good enough for most model developers. Hopefully this helps!

qingquansong commented 2 months ago

Thank you for the rapid response!