huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10.22k stars 1.3k forks source link

Feature Request: Self-Improving Robust Preference Optimization (SRPO) #1714

Open duyvuleo opened 5 months ago

duyvuleo commented 5 months ago

Hi,

This new paper (https://arxiv.org/pdf/2406.01660v2) looks very compelling.

For offline RLHF, SRPO looks outperforming DPO with OOD tasks.

Is there a plan to implement this in TRL?

I could not find SRPO implementation on Github yet.

Thanks!

Trangle commented 5 months ago

image

The main work is on sample construction, which has changed from the original estimation of (x, yl)+(x, yw) to (x+yl, yl)+(x+yl, yw)+(x+yw, yl)+(x+yw, yw), resulting in a significant increase in sample length, feedforward steps, and computational complexity. In addition, although the explanation of the model strategy remains the same, the actual input distribution has changed. In reality, when we expect a better output, we only know X. This is equivalent to conducting multiple rounds of estimation. In order to maintain the original distribution and introduce a new distribution, we need to design at least two stages or even N stages (this method actually allows for multiple generations). It seems more concise to address this issue by introducing a reward model.

frasermince commented 5 months ago

I'm planning to attempt to add this to TRL. Hope to have a PR ready relatively soon!