Closed gaetanlop closed 8 months ago
Looks interesting - would be a cool addition indeed!
Cool, I will work on this soon
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Let's keep it open :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
P3O (Pairwise Policy Optimization) is a recent paper from Berkeley:
It introduces a new way to align LLMs to human preferences. The loss function is particularly cool as it directly operates on comparative rewards. They show that it outperforms Direct Preference Optimization and Proximal Policy Optimization in terms of KL-Reward trade-off and GPT-4 Evaluation.
What do you think of adding it to trl? @younesbelkada @lvwerra
If you are interested, I can work on this.