Models testing themselves will always be biased.

lm-sys / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.

Apache License 2.0

316 stars 29 forks source link

Thanks for the feedback! In early experiments, we found if judge generate answers first, the effect is similar to chain of thought or having a reference answer, it improves the judge's ability to check the factual accuracy of the answer's response, hence significantly improves judgment quality. Due to Arena Hard being very difficult prompts, it is also very difficult to judge as well using only a simple pairwise comparison. However, we are definitely looking for more effective way of judging, reducing both biases and improve accuracy. We are studying this currently and will publish results in an upcoming paper.

And if you haven't already, please check out our blog post linked in the readme. It discuss in details about the judging process and various experiments on biases if you are interested.

lm-sys / arena-hard-auto

Models testing themselves will always be biased. #10