Significant Performance Drop in GSM8k Evaluation with Updated SFT ckpt

uclaml / SPIN

The official implementation of Self-Play Fine-Tuning (SPIN)

https://uclaml.github.io/SPIN/

Apache License 2.0

1.05k stars 92 forks source link

Significant Performance Drop in GSM8k Evaluation with Updated SFT ckpt #24

Open yinyueqin opened 8 months ago

yinyueqin commented 8 months ago

Hi,

Thank you for your work. We're re-evaluating experiments using an updated SFT ckpt from https://huggingface.co/alignment-handbook/zephyr-7b-sft-full and using lm-evaluation-harness v0.4.0 for evaluation. We've noticed a significant performance drop in GSM8k. We trained the model for 6 epochs in each iteration. Have you observed this issue or have insights into potential causes?

junkangwu commented 8 months ago

It could be related to the version of lm-evaluation-harness. For more details, see https://github.com/uclaml/SPIN/issues/12#issuecomment-1960974723.

Additionally, after updating the SFT checkpoint from https://huggingface.co/alignment-handbook/zephyr-7b-sft-full, the relative improvement between iteration 0 and iteration 1 appears to be marginal. Are there any new parameter settings being recommended?

yinyueqin commented 8 months ago

I use lm-evaluation-harness v0.4.0 for evaluation, which is consistent with the evaluation version used by the author. In addition, the results displayed above are obtained using num_train_epochs=6 for training.

AGTSAAA commented 6 months ago

Hi @yinyueqin Have you reproduced the preformance? I also can not reproduce the preformance