stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
https://crfm.stanford.edu/helm
Apache License 2.0
1.86k stars 243 forks source link

Incorrect scoring due to answer format mismatch in MMLU evaluation #2939

Open DerryChan opened 4 weeks ago

DerryChan commented 4 weeks ago

image

Description

During the MMLU evaluation of our LLM, we encountered an issue where correct answers are being marked as incorrect due to format mismatches. Specifically, when the model outputs the correct numerical answer but includes a preceding letter (e.g., "C. 12" instead of just "12"), the scoring system fails to recognize it as correct.

Current Behavior

The scoring system marks answers as incorrect if they don't exactly match the reference answer format, even if the numerical value is correct.

Expected Behavior

The scoring system should be able to correctly identify and score answers that are numerically correct, regardless of minor formatting differences such as preceding letters.

yifanmai commented 3 weeks ago

Hi @DerryChan, unfortunately this is something we don't plan to support for the default built-in MMLU scenario.

Some suggestions that you could try for your use case: