Closed songmzhang closed 2 months ago
Update: I just found that the difference was in GenerationConfig
, where the no_repeat_ngram_size
is set to 6 (by default) in testing https://github.com/microsoft/LMOps/blob/daf972124f0699af18acee85473fece80fb405c2/minillm/evaluate_main.py#L57 and is not set in validation.
So is it necessary to use this parameter? It may lower the final Rouge-L score by over 1 point.
We find that the model sometimes outputs repeated stuff after SFT. Therefore, we use this parameter in evaluation. In validation, since we only compare the checkpoints from different steps, we did not use this parameter to keep the process simple.
We find that the model sometimes outputs repeated stuff after SFT. Therefore, we use this parameter in evaluation. In validation, since we only compare the checkpoints from different steps, we did not use this parameter to keep the process simple.
Thanks for your reply!
Hi, I just found that when I used the validation set for testing (i.e., running the
run_eval.sh
), I got a lower Rouge-L result than the one in validation during training. For example,gpt2-base SFT
got 25 v.s. 24 during validation and testing. And things also happen for other test sets. So is there any difference between these two processes in your code?