Incorrect scoring due to answer format mismatch in MMLU evaluation

stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).

Apache License 2.0

1.86k stars 243 forks source link

Description

During the MMLU evaluation of our LLM, we encountered an issue where correct answers are being marked as incorrect due to format mismatches. Specifically, when the model outputs the correct numerical answer but includes a preceding letter (e.g., "C. 12" instead of just "12"), the scoring system fails to recognize it as correct.

Hi @DerryChan, unfortunately this is something we don't plan to support for the default built-in MMLU scenario.

Some suggestions that you could try for your use case:

Your could change your model to respect the max_tokens parameter, which is set 1 for to MMLU. This will usually cause the model to only output the letter.
If your model is instruction-tuned, you can try adding an additional prompt that tells the model to only respond with a single letter. In particular, adding output_format_instructions=mmlu to your run entry (e.g. mmlu:output_format_instructions=mmlu,model=text) will add "Answer with only a single letter." to the prompt.
You could implement your own MMLU variant that uses a modified metric that performs the additional post-processing necessary to interpret your model's output.

stanford-crfm / helm

Incorrect scoring due to answer format mismatch in MMLU evaluation #2939

Description

Current Behavior

Expected Behavior