microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.39k stars 253 forks source link

[MiniLLM] Got different results from validation during training and testing after training using the same data #205

Closed songmzhang closed 2 months ago

songmzhang commented 2 months ago

Hi, I just found that when I used the validation set for testing (i.e., running the run_eval.sh), I got a lower Rouge-L result than the one in validation during training. For example, gpt2-base SFT got 25 v.s. 24 during validation and testing. And things also happen for other test sets. So is there any difference between these two processes in your code?

songmzhang commented 2 months ago

Update: I just found that the difference was in GenerationConfig, where the no_repeat_ngram_size is set to 6 (by default) in testing https://github.com/microsoft/LMOps/blob/daf972124f0699af18acee85473fece80fb405c2/minillm/evaluate_main.py#L57 and is not set in validation.

So is it necessary to use this parameter? It may lower the final Rouge-L score by over 1 point.

t1101675 commented 2 months ago

We find that the model sometimes outputs repeated stuff after SFT. Therefore, we use this parameter in evaluation. In validation, since we only compare the checkpoints from different steps, we did not use this parameter to keep the process simple.

songmzhang commented 2 months ago

We find that the model sometimes outputs repeated stuff after SFT. Therefore, we use this parameter in evaluation. In validation, since we only compare the checkpoints from different steps, we did not use this parameter to keep the process simple.

Thanks for your reply!