stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).
https://crfm.stanford.edu/helm
Apache License 2.0
1.93k stars 247 forks source link

help understand HELM results #1823

Closed aniketmaurya closed 4 weeks ago

aniketmaurya commented 1 year ago

I am running Llama-2 7B base model on TruthfulQA and get the following results. On inspecting the prediction outputs, I see results are unmapped which I don't understand what it means. Any help here would be deeply appreciated. (I am using the Neurips client with Lit-GPT for evaluation)

image image
yifanmai commented 1 year ago

Sorry, I left an earlier comment that was an incorrect diagnosis.

It looks like your model's output text contains the prompt and the generated text concatenated together. This is not the behavior HELM expects.

When echo_prompt field is set to true, the output should contain the prompt and generated text concatenated together. When echo_prompt is set to false, the generated text should only contain the model's generated text, and should not contain the prompt.

In this case, echo_prompt is set to false, but the output text still contains the prompt and the generated text concatenated together. This prevents HELM from extracting the multiple-choice question answer from the output text.

To debug this, you can look at the output scenario_state.json file, and look under request_states[0].result.completions[0].text

aniketmaurya commented 1 year ago

thank you @yifanmai, I was able to fix this by removing the prompt from output. another question - is there a benchmark for an open source model that we can use to cross check our implementation accuracy?

yifanmai commented 1 year ago

For Llama 2 (7B) on NaturalQuestions (1000 evaluation instances, 1 random seed), our internal results are:

Exact Match: 0.28 Exact Match (Robustness): 0.229 Exact Match (Robustness): 0.219

We're hopefully publishing this very soon! In the meantime, let me know if there are any more numbers you'd like to look at.

I would expect your results to be slightly different depending on which implementation you use (we use a half-precision implementation).

aniketmaurya commented 1 year ago

hi @yifanmai, thank you for the numbers. May I ask if you have some scores for a smaller model like Pythia 1B or a smaller scenario for quick iterations (NaturalQuestions took more than 5 hours for evaluations).

Also, would it be possible for you to share the run_spec configuration like if you used data_augmentation.

aniketmaurya commented 1 year ago

I evaluated Llama-2 7B on QuAC and BoolQ

For QuAC I get (should have been around 39.7 based on benchmarks) -

image
aniketmaurya commented 1 year ago

For BoolQ I get 0 score and 0n checking BoolQ raw results, it seems like there is still some issue with the scoring.

image
image
yifanmai commented 1 year ago

Could you send me that scenario_state.json file from the boolq run? If you area able to send me the JSON of the first few elements of the request_states field as a Gist, that would be even better.

The very likely cause is that the end of sequence token </s> was appended to the generated text, which causes exact match to fail.

I don't have Pythia (1B) results yet, but I can do a run on a few scenarios and send you the results. Let me know which scenarios you are interested in.

I'd also suggest using MMLU for debugging and replicating previously published numbers, using HELM's selected five subject subsets:

entries: [
  {description: "mmlu:model=lightningai/lit-gpt,subject=abstract_algebra", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=college_chemistry", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=computer_security", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=econometrics", priority: 2}
  {description: "mmlu:model=lightningai/lit-gpt,subject=us_foreign_policy", priority: 2}
]

This is because MMLU tends to be a lot cheaper and faster to run, due to having short questions and answers.

The average MMLU score will usually be within a percentage points of the Hugging Face leaderboard despite the evaluation protocol differences. You can run the Hugging Face version locally using the Hugging Face Hub integration with HELM and compare it against the Lit-GPT version, which will allow you to compare the results and raw generations.

aniketmaurya commented 1 year ago

Thanks for your input @yifanmai! I ran HF locally on MMLU and compared with Lit-GPT. The number matches closely. So, I think the implementation is almost there. Also, I will attach the scenario_state.json soon.

HF

image

Lit-GPT

image
aniketmaurya commented 1 year ago

hi @yifanmai, please find the attached link to scenario_state.json file for BoolQA. I see one issue is that stoptoken \n is not able to break the prediction loop since the LLM produces \n\n which would have a different token value from a single \n

yifanmai commented 4 weeks ago

Closing due to staleness; feel free to reopen if you have further questions..

For the record, if the model runs into this issue, I would recommend manually truncate the response in the client by using either string matching or truncate_and_tokenize_response_text() (example).