Many benchmarks are getting saturated by new models. LMSYS has crowd-sourced a variety of hard prompts from the community and this provides a strong correlation with Elo scores.
How used is it in the community?
Recent papers are starting to report ArenaHard as a core metric to measure the improvements from new post-training methods. It is also becoming a new alternative to MT-Bench due to it's difficulty and real-world source of prompts.
Evaluation short description
Many benchmarks are getting saturated by new models. LMSYS has crowd-sourced a variety of hard prompts from the community and this provides a strong correlation with Elo scores.
Recent papers are starting to report ArenaHard as a core metric to measure the improvements from new post-training methods. It is also becoming a new alternative to MT-Bench due to it's difficulty and real-world source of prompts.
Evaluation metadata
Provide all available