Anomalously small values `gemma-2b-it` on GMS8k

lewtun commented 4 months ago

I noticed that the instruct version of gemma-2b gets anomalously small values on GSM8k. Here's the command I'm running:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|gsm8k|5|0" \
    --output_dir "./scratch/evals" \
    --model_args "pretrained=google/gemma-2b-it"    
    --override_batch_size 1

with --use_chat_template

Task	Version	Metric	Value		Stderr
lighteval:gsm8k:5	0	qem	0.0341	±	0.005

without --use_chat_template

Task	Version	Metric	Value		Stderr
lighteval:gsm8k:5	0	qem	0.0553	±	0.0063

For reference, the base model gets ~0.174 which is far better.

I think part of the problem is that GSM8k expect the answer to be formatted with #### {ANSWER} and the instruct models are quite inconsistent in this respect because they haven't been told to do so.

Here's an instructive example where the model produces the correct answer, but would be scored 0 because it didn't predict #### {ANSWER}:

Prompt	Completion	Ground truth
Question: A pet store currently has 5 dogs, 2 cats, and 10 birds. How many legs in total do the pets in the store have? Answer:	There are 5 dogs 4 legs/dog + 2 cats 4 legs/cat + 10 birds * 2 legs/bird = 20 legs/dog + 8 legs/cat + 20 legs/bird. So, the total number of legs in the store is 20 + 8 + 20 = 48 legs.	The dogs have 5 dogs 4 legs/dog = <<54=20>>20 legs. The cats have 2 cats 4 legs/cat = <<24=8>>8 legs. The birds have 10 birds 2 legs/bird = <<102=20>>20 legs. The pets have 20 legs + 8 legs + 20 legs = <<20+8+20=48>>48 legs. #### 48

Perhaps one solution would be to format the input like GPQA does:

Here are some example questions from experts. Format your final response with: "#### {insert answer here}

Question: {few_shot_q}
Answer: {few_shot_a}
#### {answer}

... N few shot examples

What is the correct answer to this question: A pet store currently has 5 dogs, 2 cats, and 10 birds. How many legs in total do the pets in the store have?

You can see in this example the the 7B instruct model formats the answer correctly: https://hf.co/chat/r/ltNE54h

lewtun commented 4 months ago

FYI for the 7B model I am seeing a lot of truncated responses like Sure, here is the solution which also suggests we are losing candidate answers in the parsing

clefourrier commented 4 months ago

For GSM8K, we constrained the evaluation to match the harness we are launching on the leaderboard - changing the prompt would change this. However, maybe we could add an "instruct" parameter?

It's fascinating that instruct models become less good at following few shots formatting!

clefourrier commented 4 months ago

Edit: checked and the truncation used in the harness evolved since the above version. I'm going to edit the allowed eos token to fix this.

NathanHB commented 4 months ago

The truncation used in the harness evolved quite a lot in different versions. The most up to date one is: [\n\n, Question:] See here

huggingface / lighteval

Anomalously small values `gemma-2b-it` on GMS8k #82