lm-sys / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
312 stars 29 forks source link

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. #27

Open xiamengzhou opened 3 weeks ago

xiamengzhou commented 3 weeks ago

Hi! Thanks for releasing this amazing benchmark, and we have been heavily using it for our development. When I run show_result.py, the following warning occurs and it seems thta logistic regression model failed to converge. Should I be concerned about it?

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
bootstrap:   2%|████▍                                                                                                                                                                                                                         | 4/200 [00:07<05:28,  1.68s/it]
/scratch/gpfs/mengzhou/anaconda3/envs/handbook/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
CodingWithTim commented 3 weeks ago

Hey there! Could share more details on the pairwise comparison dataframes you are feeding into the compute_mle_elo function? How big is it and what does it contain? Thanks.

xiamengzhou commented 3 weeks ago

Thanks for your prompt response!

The battles dataframe we are using has dimensions of (124,067, 4), containing results from 76 models, with the baseline model being gpt4-0414. We also uploaded the battles jsonl file. arena_hard_battles.jsonl.zip

CodingWithTim commented 2 weeks ago

@xiamengzhou Hi, is this issue still blocking? Sorry, was busy with some upcoming releases. It seems like your battle jsonl is really really big :0. Even I haven't use Arena-Hard-Auto on this many models XD. I think it might be a good idea to

  1. reduce the number of models you are trying to compare. If you have to have all the models score. I recommend split the models into two batches. So just move some of the model judgments out of the directory where the model judgments are stored.
  2. you can check out this related issue if you haven't already. link

Let me know how it goes!

xiamengzhou commented 2 weeks ago

Thank you so much, @CodingWithTim, for the pointer!

I'm curious about the best practice for obtaining the final score for a model. Should I run logistic regression with all the models I have (I noticed you reported scores for many models in the README and by default the script takes all the model evals)? Alternatively, should I use a fixed number of models or even just one? Would different strategies impact the final score?

infwinston commented 2 weeks ago

I think a possible cause is the logistic regression problem is ill-conditioned when some of the models do not have enough samples. So the solver takes more iterations than usual to converge. We might need to add a small regularization term to mitigate this. But usually I find it fine to ignore the warning. We'll investigate more.

xiamengzhou commented 2 weeks ago

Thanks @infwinston @CodingWithTim ! It seems that the win rate or the elo rating won't be affected by the amount of the models in the batch as long as the base model is fixed! Thanks for your kind help. The issue is gone when I just use show_results for a single model's verdict.