princeton-nlp / SimPO

SimPO: Simple Preference Optimization with a Reference-Free Reward
MIT License
637 stars 37 forks source link

Difference with changing the gradient accumulation - ZeroEval and AlpacaEval 2 #61

Open sahsaeedi opened 1 month ago

sahsaeedi commented 1 month ago

Hi,

I fine-tuned the LLaMA-3-8b-Instruct on "llama3-ultrafeedback-armorm" in different gradient accumulation (other hyperparameters are the same with llama-3-8b-instruct-simpo-v2.yaml). For fine-tuning, I used 4 A100:80GB.

The results on Alpaca-Eval 2 (I used your config for evaluation): Gradient Acc: 16 - Batch-Size-Per-Device: 2 - LC: 50.34 - WR: 47.4 Gradient Acc: 128 - Batch-Size-Per-Device: 2 - LC: 34.78 - WR: 31.99 Gradient Acc: 16 - Batch-Size-Per-Device: 4 - LC: 39.16 - WR: 44.97

The results on MMLU-Redux and GSM8k (ZeroEval): Gradient Acc: 16 - Batch-Size-Per-Device: 2 - MMLU: 43.38 - GSM8k: 58 Gradient Acc: 128 - Batch-Size-Per-Device: 2 - MMLU: 62.38 - GSM8k: 79.68 Gradient Acc: 16 - Batch-Size-Per-Device: 4 - MMLU: 62.1 - GSM8k: 78.85

The ability of the model to make a reason to find the correct answer will increase if we use a larger Gradient Acc. However, the performance of the model on AlpacaEval 2 will decrease. How can we conclude that SimPO is better than other methods?

I think AlpacaEval 2 just evaluates the style of the answer, which is not a good way to compare the two models.

xiamengzhou commented 3 weeks ago

Hi @sahsaeedi , thanks for using SimPO and reporting your results back to us! If I understand it correctly,

Gradient Acc: 16 - Batch-Size-Per-Device: 2 -> Total Batch Size: 128 Gradient Acc: 128 - Batch-Size-Per-Device: 2 -> Total Batch Size: 1024 Gradient Acc: 16 - Batch-Size-Per-Device: 4 -> Total Batch Size: 256

It seems that with a larger effective batch size, chat abilities are not as well trained. We've seen similar phenomenon.

In our experiments, to ensure a fair comparison, we maintain the same batch size across different methods. Ideally, we would perform a grid search for each method wrt different batch sizes, but preliminary results indicate that the trend remains consistent as long as we use the same SFT model for different algorithms.

You've observed that chat ability appears to be at odds with ZeroEval, which we've also openly discussed in our repo README. We’ve identified this issue specifically with Llama 3 instruct models, where they are prone to catastrophic forgetting. However, when using Gemma models, we find this problem is significantly reduced, and training with PO actually improves chat scores without compromising ZeroEval results. You can find more details in this section of the README. For further studying continued training with instruction tuned models, I'd suggest using gemma models.

In summary, I believe your findings highlight a combination of two factors:

I think AlpacaEval 2 just evaluates the style of the answer, which is not a good way to compare the two models.

Yes, and that's why we have done thorough evaluations with Arena-Hard and WildBench as well, which are two much more challenging benchmarks for evaluating chat abilities of models, and we find the trends of the two largely consistent with AlpacaEval 2.

Please let me know if this clears up any confusion!

sahsaeedi commented 2 weeks ago

Hi @xiamengzhou, Thanks for answering my concerns,

Your main improvement is on AlpcaEval 2. The difference between DPO and SimPO in MT-Bench is less than 0.1, and for Arena-Hard, it is less than 0.5%. Still, this question arises: Is SimPO the SOTA method or DPO?

xiamengzhou commented 2 weeks ago

@Sahsaeedi, thank you for your question.

We've tested various settings to ensure a fair comparison between DPO and SimPO. Here's what we've observed:

Based on these observations, I am confident in stating that:

We should have conveyed this more clearly in our materials, such as the GitHub repository, preprint, and Twitter posts! Let me know if this answers your question, and I'm happy to have further discussions.

yumeng5 commented 2 weeks ago

Hi @sahsaeedi

In addition to Mengzhou's answers above, I also wanted to mention the potential issues with the evaluation metrics of MT-Bench and Arena-Hard:

Best, Yu