huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
MIT License
832 stars 99 forks source link

[EVAL] Add ArenaHardAuto #325

Open lewtun opened 2 months ago

lewtun commented 2 months ago

Evaluation short description

Many benchmarks are getting saturated by new models. LMSYS has crowd-sourced a variety of hard prompts from the community and this provides a strong correlation with Elo scores.

Recent papers are starting to report ArenaHard as a core metric to measure the improvements from new post-training methods. It is also becoming a new alternative to MT-Bench due to it's difficulty and real-world source of prompts.

Evaluation metadata

Provide all available