huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.24k stars 1.16k forks source link

Adding P3O trainer #905

Closed gaetanlop closed 8 months ago

gaetanlop commented 10 months ago

P3O (Pairwise Policy Optimization) is a recent paper from Berkeley:

It introduces a new way to align LLMs to human preferences. The loss function is particularly cool as it directly operates on comparative rewards. They show that it outperforms Direct Preference Optimization and Proximal Policy Optimization in terms of KL-Reward trade-off and GPT-4 Evaluation.

image

What do you think of adding it to trl? @younesbelkada @lvwerra

If you are interested, I can work on this.

lvwerra commented 10 months ago

Looks interesting - would be a cool addition indeed!

gaetanlop commented 10 months ago

Cool, I will work on this soon

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

lvwerra commented 9 months ago

Let's keep it open :)

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.