Closed lewtun closed 4 months ago
FYI for the 7B model I am seeing a lot of truncated responses like Sure, here is the solution
which also suggests we are losing candidate answers in the parsing
For GSM8K, we constrained the evaluation to match the harness we are launching on the leaderboard - changing the prompt would change this. However, maybe we could add an "instruct" parameter?
It's fascinating that instruct models become less good at following few shots formatting!
Edit: checked and the truncation used in the harness evolved since the above version. I'm going to edit the allowed eos token to fix this.
I noticed that the instruct version of
gemma-2b
gets anomalously small values on GSM8k. Here's the command I'm running:with --use_chat_template
without --use_chat_template
For reference, the base model gets ~0.174 which is far better.
I think part of the problem is that GSM8k expect the answer to be formatted with
#### {ANSWER}
and the instruct models are quite inconsistent in this respect because they haven't been told to do so.Here's an instructive example where the model produces the correct answer, but would be scored 0 because it didn't predict
#### {ANSWER}
:Perhaps one solution would be to format the input like GPQA does:
You can see in this example the the 7B instruct model formats the answer correctly: https://hf.co/chat/r/ltNE54h