Difference with changing the gradient accumulation - ZeroEval and AlpacaEval 2

sahsaeedi commented 1 month ago

Hi,

I fine-tuned the LLaMA-3-8b-Instruct on "llama3-ultrafeedback-armorm" in different gradient accumulation (other hyperparameters are the same with llama-3-8b-instruct-simpo-v2.yaml). For fine-tuning, I used 4 A100:80GB.

The results on Alpaca-Eval 2 (I used your config for evaluation): Gradient Acc: 16 - Batch-Size-Per-Device: 2 - LC: 50.34 - WR: 47.4 Gradient Acc: 128 - Batch-Size-Per-Device: 2 - LC: 34.78 - WR: 31.99 Gradient Acc: 16 - Batch-Size-Per-Device: 4 - LC: 39.16 - WR: 44.97

The results on MMLU-Redux and GSM8k (ZeroEval): Gradient Acc: 16 - Batch-Size-Per-Device: 2 - MMLU: 43.38 - GSM8k: 58 Gradient Acc: 128 - Batch-Size-Per-Device: 2 - MMLU: 62.38 - GSM8k: 79.68 Gradient Acc: 16 - Batch-Size-Per-Device: 4 - MMLU: 62.1 - GSM8k: 78.85

The ability of the model to make a reason to find the correct answer will increase if we use a larger Gradient Acc. However, the performance of the model on AlpacaEval 2 will decrease. How can we conclude that SimPO is better than other methods?

I think AlpacaEval 2 just evaluates the style of the answer, which is not a good way to compare the two models.

xiamengzhou commented 3 weeks ago

Hi @sahsaeedi , thanks for using SimPO and reporting your results back to us! If I understand it correctly,

Gradient Acc: 16 - Batch-Size-Per-Device: 2 -> Total Batch Size: 128 Gradient Acc: 128 - Batch-Size-Per-Device: 2 -> Total Batch Size: 1024 Gradient Acc: 16 - Batch-Size-Per-Device: 4 -> Total Batch Size: 256

It seems that with a larger effective batch size, chat abilities are not as well trained. We've seen similar phenomenon.

In our experiments, to ensure a fair comparison, we maintain the same batch size across different methods. Ideally, we would perform a grid search for each method wrt different batch sizes, but preliminary results indicate that the trend remains consistent as long as we use the same SFT model for different algorithms.

You've observed that chat ability appears to be at odds with ZeroEval, which we've also openly discussed in our repo README. We’ve identified this issue specifically with Llama 3 instruct models, where they are prone to catastrophic forgetting. However, when using Gemma models, we find this problem is significantly reduced, and training with PO actually improves chat scores without compromising ZeroEval results. You can find more details in this section of the README. For further studying continued training with instruction tuned models, I'd suggest using gemma models.

In summary, I believe your findings highlight a combination of two factors:

Llama 3 instruct models are prone to catastrophic forgetting.
A large batch size is less effective in learning PO, which helps preserve MMLU and GSK performance from the original llama3 instruct model

I think AlpacaEval 2 just evaluates the style of the answer, which is not a good way to compare the two models.

Yes, and that's why we have done thorough evaluations with Arena-Hard and WildBench as well, which are two much more challenging benchmarks for evaluating chat abilities of models, and we find the trends of the two largely consistent with AlpacaEval 2.

Please let me know if this clears up any confusion!

sahsaeedi commented 2 weeks ago

Hi @xiamengzhou, Thanks for answering my concerns,

Your main improvement is on AlpcaEval 2. The difference between DPO and SimPO in MT-Bench is less than 0.1, and for Arena-Hard, it is less than 0.5%. Still, this question arises: Is SimPO the SOTA method or DPO?

xiamengzhou commented 2 weeks ago

@Sahsaeedi, thank you for your question.

We've tested various settings to ensure a fair comparison between DPO and SimPO. Here's what we've observed:

In scenarios where the base models are relatively weaker (e.g., Zephyr, Mistral Instruct, or even LLaMA settings), SimPO shows a significant advantage, as detailed in our v1 series of models in the preprint.
For settings with stronger base models (e.g., Gemma), SimPO remains competitive with DPO, even though it eliminates the reference model and demands considerably more computational resources, which are aligned with the results you described.

Based on these observations, I am confident in stating that:

SimPO is at least as competitive as DPO across all the settings we've explored.
The effectiveness of methods may vary depending on the base model. For instance, the catastrophic forgetting issue of llama and the diminishing gap between methods as the base model strengthens illustrate this variability.

We should have conveyed this more clearly in our materials, such as the GitHub repository, preprint, and Twitter posts! Let me know if this answers your question, and I'm happy to have further discussions.

yumeng5 commented 2 weeks ago

Hi @sahsaeedi

In addition to Mengzhou's answers above, I also wanted to mention the potential issues with the evaluation metrics of MT-Bench and Arena-Hard:

According to the LMSYS blog, MT-Bench scores often show minimal differentiation between models (e.g., Llama3-70B's score is nearly identical to GPT-4's) and exhibit high variance. Additionally, MT-Bench has the lowest agreement rate with Chatbot Arena compared to AlpacaEval2 and Arena-Hard. For these reasons, we don't consider MT-Bench scores to be a reliable metric for ranking models or methods. While we included MT-Bench scores in our paper due to their widespread use in prior studies, we clearly stated that AlpacaEval2 and Arena-Hard scores are more reliable.
The Arena-Hard evaluation does not use length penalty mechanisms. However, as many studies have shown, win rate metrics are often biased toward longer responses. We observed that DPO models generally produce longer responses than SimPO models on Arena-Hard, yet SimPO consistently achieves a higher win rate. We believe that SimPO's ability to achieve higher win rates with shorter responses is a significant advantage.

Best, Yu

princeton-nlp / SimPO

Difference with changing the gradient accumulation - ZeroEval and AlpacaEval 2 #61