lmarena / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
606 stars 71 forks source link

Discrepancy in Scores When Switching GPT Model Versions #16

Closed wlhgtc closed 5 months ago

wlhgtc commented 5 months ago

I recently judge the model answer provided here and decided to switch the GPT version from gpt-4-1106-preview to gpt-4-0125-preview. Cause I can only access instances of this version. After making this change, I observed a discrepancy of over 440 points(overall 1000) in the score compared to the judgement benchmarks listed in your documentation.

Could you please advise on how to address this issue or suggest any solutions that might help mitigate this discrepancy?

CodingWithTim commented 5 months ago

Thanks for your feedback! Could you elaborate on what do you mean by discrepancy of over 440 points?

To be completely transparent: from our own experiment, changing judge can be provide very different results. We tried GPT-4-turbo-04-09, and the scores was very different for GPT-4-0613 and some other models. As the prompt engineering for the judge was optimized specifically for GPT-4-1106-preview. Even though GPT-4-0125-preview is also GPT-4-Turbo, the two model has significant difference in their ability and preferences as well. Currently, we recommend to not use any judge other than GPT-4-1106-preview. We are currently working on a new version of Arena Hard addressing many of the limitations of the current version (ETA around June).

If anyone can come up with solution to this limitation, we would love to hear that as well! Thanks!

wlhgtc commented 5 months ago

use the gpt-4-turbo-2024-04-09 will be better than the gpt-4-1106-preview? cause the former is an official release (not a preview version)

CodingWithTim commented 5 months ago

I wouldn't recommend using gpt-4-turbo-2024-04-09, I have tested it and score for a few model wasn't ideal. But we are looking into this and working on implementing and experiment with new judges. I expected a new system prompt would be needed to optimized for the new judge depending the the judge's model's characteristics. Thanks!

wlhgtc commented 5 months ago

@CodingWithTim

I used the answer file (gpt-4-0613.jsonl) from your repository and tested two versions of GPT (gpt-4-0125-preview and gpt-4-turbo-2024-04-09). Here are the results: CleanShot 2024-05-22 at 17 28 13@2x

Would it be possible for you to open source all the model answers? This would allow us to obtain the rank scores for a specific GPT version.

Regarding the missed question about the "discrepancy of over 440 points": I meant that in the 1000 judgment records (for two versions of GPT), I found that 440 out of 1000 results are different.

CodingWithTim commented 5 months ago

Hi, it seems like they are pretty close in score (gpt-4-0125-preview and gpt-4-turbo-2024-04-09). The rest of the model answer are available at on huggingface lmsys/arena-hard-browser. Is this what you are looking for?

wlhgtc commented 4 months ago

Hi, it seems like they are pretty close in score (gpt-4-0125-preview and gpt-4-turbo-2024-04-09). The rest of the model answer are available at on huggingface lmsys/arena-hard-browser. Is this what you are looking for?

Yes, that's exactly what I need.

For some reason, I can only choose between gpt-4-0125-preview and gpt-4-turbo-2024-04-09. Based on your experience, does the former seem like a good choice?

nirrai21 commented 1 month ago

@CodingWithTim Hi, you mentioned in this conversation that you intend to change the judge model in future versions. Are you still considering/working on that? It can be quite useful to have a faster and cheaper way to compute the arena-hard score.

CodingWithTim commented 1 month ago

@nirrai21 Yes, we are still working on that! Arena-Hard-v0.2 will introduce new judges.

nirrai21 commented 1 month ago

@CodingWithTim Great news! Do you have any estimate on when you guys intend to release it?

CodingWithTim commented 1 month ago

@nirrai21 While I am unsure when will we be able to release the new version as of right now. But it will be very soon after we successfully debias LLM judges as much as possible. Probably 1-2 months?