Closed qingquansong closed 7 months ago
Hi, interesting find! This is because we are only doing 100 rounds of bootstrapping which isn't the most precise CI intervals. There is a randomness involved when bootstrapping, so the more rounds the better. We didn't do more because it would take longer for user to generate the score. You can try 500 rounds of bootstrapping or even 1000 rounds, the CI interval should stabilizes around then. But 100 rounds is good enough for most model developers. Hopefully this helps!
Thank you for the rapid response!
Hey Team,
Thanks for sharing the benchmarks! I'm testing the scripts by simply copy a model answer as well as the judgement and change the model ids in each jsonl file, but got different CI results after showing as below:
Is it related to the bootstrapping? Thanks!
Best regards, QQ