lm-sys / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
316 stars 29 forks source link

Models testing themselves will always be biased. #10

Closed HideLord closed 2 months ago

HideLord commented 2 months ago

The prompt is formatted such that they are supposed to answer the question before judging. If the model is judging itself, then it will compare its own answer with... its own answer, which will be mostly the same. That holds true for models of the same family as well – GPT-4-turbo-preview will have nearly the same answers as GPT-4-turbo. Same with the Claude suite.

Naturally, the judge is going to prefer the answer that most resembles its own. In the end, I'm wondering if the judge solution is necessary.

CodingWithTim commented 2 months ago

Thanks for the feedback! In early experiments, we found if judge generate answers first, the effect is similar to chain of thought or having a reference answer, it improves the judge's ability to check the factual accuracy of the answer's response, hence significantly improves judgment quality. Due to Arena Hard being very difficult prompts, it is also very difficult to judge as well using only a simple pairwise comparison. However, we are definitely looking for more effective way of judging, reducing both biases and improve accuracy. We are studying this currently and will publish results in an upcoming paper.

And if you haven't already, please check out our blog post linked in the readme. It discuss in details about the judging process and various experiments on biases if you are interested.