Open duyvuleo opened 5 months ago
The main work is on sample construction, which has changed from the original estimation of (x, yl)+(x, yw) to (x+yl, yl)+(x+yl, yw)+(x+yw, yl)+(x+yw, yw), resulting in a significant increase in sample length, feedforward steps, and computational complexity. In addition, although the explanation of the model strategy remains the same, the actual input distribution has changed. In reality, when we expect a better output, we only know X. This is equivalent to conducting multiple rounds of estimation. In order to maintain the original distribution and introduce a new distribution, we need to design at least two stages or even N stages (this method actually allows for multiple generations). It seems more concise to address this issue by introducing a reward model.
I'm planning to attempt to add this to TRL. Hope to have a PR ready relatively soon!
Hi,
This new paper (https://arxiv.org/pdf/2406.01660v2) looks very compelling.
For offline RLHF, SRPO looks outperforming DPO with OOD tasks.
Is there a plan to implement this in TRL?
I could not find SRPO implementation on Github yet.
Thanks!