Closed xiamengzhou closed 4 months ago
Hey there! Could share more details on the pairwise comparison dataframes you are feeding into the compute_mle_elo function? How big is it and what does it contain? Thanks.
Thanks for your prompt response!
The battles dataframe we are using has dimensions of (124,067, 4), containing results from 76 models, with the baseline model being gpt4-0414. We also uploaded the battles jsonl file. arena_hard_battles.jsonl.zip
@xiamengzhou Hi, is this issue still blocking? Sorry, was busy with some upcoming releases. It seems like your battle jsonl is really really big :0. Even I haven't use Arena-Hard-Auto on this many models XD. I think it might be a good idea to
Let me know how it goes!
Thank you so much, @CodingWithTim, for the pointer!
I'm curious about the best practice for obtaining the final score for a model. Should I run logistic regression with all the models I have (I noticed you reported scores for many models in the README and by default the script takes all the model evals)? Alternatively, should I use a fixed number of models or even just one? Would different strategies impact the final score?
I think a possible cause is the logistic regression problem is ill-conditioned when some of the models do not have enough samples. So the solver takes more iterations than usual to converge. We might need to add a small regularization term to mitigate this. But usually I find it fine to ignore the warning. We'll investigate more.
Thanks @infwinston @CodingWithTim ! It seems that the win rate or the elo rating won't be affected by the amount of the models in the batch as long as the base model is fixed! Thanks for your kind help. The issue is gone when I just use show_results for a single model's verdict.
Hi! Thanks for releasing this amazing benchmark, and we have been heavily using it for our development. When I run
show_result.py
, the following warning occurs and it seems thta logistic regression model failed to converge. Should I be concerned about it?