LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
Many papers nowadays report the maj@k metric for math benchmarks like GMS8k and MATH, where the model generates k candidates to a problem and the most common answer is chosen as the final solution (see source paper for details).
It would be nice to support maj@k as a metric for these benchmark, potentially also including the ability to have CoT prompts as is also common practice.
Many papers nowadays report the
maj@k
metric for math benchmarks like GMS8k and MATH, where the model generatesk
candidates to a problem and the most common answer is chosen as the final solution (see source paper for details).It would be nice to support
maj@k
as a metric for these benchmark, potentially also including the ability to have CoT prompts as is also common practice.