[EVAL] Add ArenaHardAuto

Evaluation short description

Why is this evaluation interesting?

Many benchmarks are getting saturated by new models. LMSYS has crowd-sourced a variety of hard prompts from the community and this provides a strong correlation with Elo scores.

How used is it in the community?

Recent papers are starting to report ArenaHard as a core metric to measure the improvements from new post-training methods. It is also becoming a new alternative to MT-Bench due to it's difficulty and real-world source of prompts.

Evaluation metadata

Provide all available

Paper url: https://arxiv.org/abs/2406.11939
Github url: https://github.com/lm-sys/arena-hard-auto
Dataset url: https://github.com/lm-sys/arena-hard-auto/blob/main/data/arena-hard-v0.1/question.jsonl

huggingface / lighteval

[EVAL] Add ArenaHardAuto #325

Evaluation short description

Evaluation metadata