Closed pradeepdev-1995 closed 9 months ago
They serve completely different purposes:
Hope this helps.
but in both trainers, we are passing the accepted and rejected data pairs. so both using RL right? @lvwerra
No the RewardModel is just a classifier, so it does not generate text, whereas the DPOTrainer directly trains the language model.
@lvwerra In the official huggingface documentation The rewardtrainer dataset is given as below
https://huggingface.co/datasets/Anthropic/hh-rlhf?row=8
Its a text generation dataset rather than a classification dataset.
It is a preference dataset with chosen and rejected samples. The RewardModel is trained to classify the samples into these two classes.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
What is the difference between RewardTrainer and DPOTrainer? when to use each over the other?