hkust-nlp / deita

Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]
Apache License 2.0
418 stars 26 forks source link

Questions about performance improvement in Open LLM leaderboard #21

Open minstar opened 4 months ago

minstar commented 4 months ago

Hi, First of all, thank you for sharing your wonderful work!

I was searching for efficient ways of mining instructions used in instruction-tuning LLMs. While reading the manuscript and investigating your provided open-sourced 6k & 10k datasets, I could not intuitively understand why the SFT (6k) +DPO (10k) training method increases the performance of the multi-choice question answering tasks such as ARC-challenge and MMLU?

In the dataset, the instances are composed of conversations between humans and GPT which don't have any clue about solving multi-choice QA problems.

Do you have any ideas why it worked?

VPeterV commented 4 months ago

Hi, thanks for your interest!

This question is indeed interesting. We have a couple of speculations that might shed some light:

  1. Our top-performing model, trained using SFT (6k) combined with DPO (10k), originates from an intermediate SFT checkpoint. This checkpoint serves as the basis for further DPO training. Our hypothesis is that an overly optimized SFT might impair the inherent capabilities of LLMs. Therefore, utilizing a sub-optimal SFT checkpoint, followed by DPO training, which is specifically designed for alignment, appears to enhance performance on both academic benchmarks like the OpenLLM benchmark and alignment capabilities. This finding can also be found on Zephyr [1, 2].

  2. It's observed that some questions incorrectly answered by the models can be rectified through multiple sampling attempts, employing strategies like majority voting or re-ranking. This indicates that the model has the potential to answer correctly but struggles to do so consistently. Reinforcement learning techniques such as DPO can adjust the model's output preferences, increasing the likelihood of producing the correct answer in a single attempt [3, Section 5].

References

minstar commented 4 months ago

Thanks for suggesting your insights and thoughts about my curious question!

I also agreed on the second point that the model has the potential to answer but not consistently to do it. However, still have a hard time interpreting what DPO could enhance through preference alignment.

VPeterV commented 4 months ago

A potential explanation might be the presence of STEM-related samples within the UltraFeedback Datasets.