Closed aniketmaurya closed 4 weeks ago
Sorry, I left an earlier comment that was an incorrect diagnosis.
It looks like your model's output text contains the prompt and the generated text concatenated together. This is not the behavior HELM expects.
When echo_prompt
field is set to true, the output should contain the prompt and generated text concatenated together. When echo_prompt
is set to false, the generated text should only contain the model's generated text, and should not contain the prompt.
In this case, echo_prompt
is set to false, but the output text still contains the prompt and the generated text concatenated together. This prevents HELM from extracting the multiple-choice question answer from the output text.
To debug this, you can look at the output scenario_state.json
file, and look under request_states[0].result.completions[0].text
thank you @yifanmai, I was able to fix this by removing the prompt from output. another question - is there a benchmark for an open source model that we can use to cross check our implementation accuracy?
For Llama 2 (7B) on NaturalQuestions (1000 evaluation instances, 1 random seed), our internal results are:
Exact Match: 0.28 Exact Match (Robustness): 0.229 Exact Match (Robustness): 0.219
We're hopefully publishing this very soon! In the meantime, let me know if there are any more numbers you'd like to look at.
I would expect your results to be slightly different depending on which implementation you use (we use a half-precision implementation).
hi @yifanmai, thank you for the numbers. May I ask if you have some scores for a smaller model like Pythia 1B or a smaller scenario for quick iterations (NaturalQuestions took more than 5 hours for evaluations).
Also, would it be possible for you to share the run_spec configuration like if you used data_augmentation
.
I evaluated Llama-2 7B on QuAC and BoolQ
For QuAC I get (should have been around 39.7 based on benchmarks) -
For BoolQ I get 0 score and 0n checking BoolQ raw results, it seems like there is still some issue with the scoring.
Could you send me that scenario_state.json
file from the boolq run? If you area able to send me the JSON of the first few elements of the request_states
field as a Gist, that would be even better.
The very likely cause is that the end of sequence token </s>
was appended to the generated text, which causes exact match to fail.
I don't have Pythia (1B) results yet, but I can do a run on a few scenarios and send you the results. Let me know which scenarios you are interested in.
I'd also suggest using MMLU for debugging and replicating previously published numbers, using HELM's selected five subject subsets:
entries: [
{description: "mmlu:model=lightningai/lit-gpt,subject=abstract_algebra", priority: 2}
{description: "mmlu:model=lightningai/lit-gpt,subject=college_chemistry", priority: 2}
{description: "mmlu:model=lightningai/lit-gpt,subject=computer_security", priority: 2}
{description: "mmlu:model=lightningai/lit-gpt,subject=econometrics", priority: 2}
{description: "mmlu:model=lightningai/lit-gpt,subject=us_foreign_policy", priority: 2}
]
This is because MMLU tends to be a lot cheaper and faster to run, due to having short questions and answers.
The average MMLU score will usually be within a percentage points of the Hugging Face leaderboard despite the evaluation protocol differences. You can run the Hugging Face version locally using the Hugging Face Hub integration with HELM and compare it against the Lit-GPT version, which will allow you to compare the results and raw generations.
Thanks for your input @yifanmai! I ran HF locally on MMLU and compared with Lit-GPT. The number matches closely. So, I think the implementation is almost there. Also, I will attach the scenario_state.json
soon.
HF
Lit-GPT
hi @yifanmai, please find the attached link to scenario_state.json
file for BoolQA.
I see one issue is that stoptoken \n
is not able to break the prediction loop since the LLM produces \n\n
which would have a different token value from a single \n
Closing due to staleness; feel free to reopen if you have further questions..
For the record, if the model runs into this issue, I would recommend manually truncate the response in the client by using either string matching or truncate_and_tokenize_response_text()
(example).
I am running Llama-2 7B base model on TruthfulQA and get the following results. On inspecting the prediction outputs, I see results are
unmapped
which I don't understand what it means. Any help here would be deeply appreciated. (I am using the Neurips client with Lit-GPT for evaluation)