lm-sys / arena-hard-auto

Arena-Hard-Auto: An automatic LLM benchmark.
Apache License 2.0
316 stars 29 forks source link

Majority of questions are coding questions! #5

Closed nxphi47 closed 2 months ago

nxphi47 commented 2 months ago

Thanks for the great work. The questions are so hard that I couldn't answer any of them.

However, I found a large majority of the questions are coding and CS-related, which would favor code-oriented models and disadvantage other general-purpose models, despite the fact that this benchmark is being advertised as general-purpose MT-bench replacement.

How do you think?

tonysy commented 2 months ago

Yes, I also find many problems are coding/math related, which means the evaluation results are biased.

CodingWithTim commented 2 months ago

Hi! Thanks for the feedback. If you guys haven't already, please check out the Pipeline section of our blog post, where we explain how these prompts are selected.

To clarify any misunderstanding, we do not advertise this benchmark to be comprehensive to all categories. Arena Hard v0.1, like its name suggests, consists the "harder" questions from Chatbot Arena. In the pipeline process, broadly, we are targeting harder, and more problem solving oriented questions. It turns out many of these harder and more problem solving oriented questions people on Chatbot Arena asks are coding and cs related (likely due to many arena users being cs enthusiasts as well). As a by product, Arena Hard v0.1 contains a lot of coding/cs questions (there is still a significant portion of prompts non cs related, such as math, marketing, medical, etc)

It is definitely possible that Arena Hard v0.1 can favor coding-oriented models, this is part of the limitation of this benchmark. We are planning on expanding and improving Arena Hard v0.1's categories and diversity alongside many other aspects. Once again, thanks for the feedback! We will keep this in mind in future iterations.