Support DPO (Direct Preference Optimization) loss and data loader.
Motivation
Many recent open LLMs have achieved promising results from using DPO instead of RL-style tuning like PPO for alignment. And it seems to require less changes to llm-foundry than RLHF.
🚀 Feature Request
Support DPO (Direct Preference Optimization) loss and data loader.
Motivation
Many recent open LLMs have achieved promising results from using DPO instead of RL-style tuning like PPO for alignment. And it seems to require less changes to llm-foundry than RLHF.