Difference between RewardTrainer and DPOTrainer? when to use each over the other?

huggingface / trl

Train transformer language models with reinforcement learning.

http://hf.co/docs/trl

Apache License 2.0

10k stars 1.27k forks source link

Difference between RewardTrainer and DPOTrainer? when to use each over the other? #1106

Closed pradeepdev-1995 closed 9 months ago

pradeepdev-1995 commented 10 months ago

What is the difference between RewardTrainer and DPOTrainer? when to use each over the other?

lvwerra commented 10 months ago

They serve completely different purposes:

The RewardTrainer is trained on preference data and can then be used e.g. as a reward signal to train a model using PPO
The DPOTrainer directly optimized the model on preference data without the need for a reward model and PPO

Hope this helps.

pradeepdev-1995 commented 10 months ago

but in both trainers, we are passing the accepted and rejected data pairs. so both using RL right? @lvwerra

lvwerra commented 10 months ago

No the RewardModel is just a classifier, so it does not generate text, whereas the DPOTrainer directly trains the language model.

pradeepdev-1995 commented 10 months ago

@lvwerra In the official huggingface documentation The rewardtrainer dataset is given as below

https://huggingface.co/datasets/Anthropic/hh-rlhf?row=8

Its a text generation dataset rather than a classification dataset.

lvwerra commented 10 months ago

It is a preference dataset with chosen and rejected samples. The RewardModel is trained to classify the samples into these two classes.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.