Closed jinzhuoran closed 2 months ago
Triaged as P1 since its a nice to have thing.
@jinzhuoran yes! thanks for your interests! Integrating with DPO could be definitely cool -- given the fact that ReFT allows quick iterative adaptation. Additionally, a reward model trained on ReFT is essentially the base LM + a set of very small interventions. The same base model can be trained with another set of interventions for language completion in parallel. You don't need to load two copies of models into the memory as well.
In short, there are a lot of stuff to explore with DPO + ReFT, but we are currently looking for helps! If you want to do it, let us know! We could help on the side.
btw it's super easy to implement this already with the existing ReftRewardTrainer
and slight modification of the loss computation in DPOTrainer
from the trlx
library, will add to the library soon!
@AmirZur will work on the DPO trainer, and make a PR soon! Local test with TruthfulQA seems to be promising.
Thank you for your help! I can't wait to try out this new feature!
Hi @AmirZur @frankaging, I'm trying to use ReFT on DPO, but I often encounter loss=nan. Have you ever experienced this situation?
Hi @jinzhuoran! I haven't run into loss=nan issues yet. You can find my implementation and a small walkthrough notebook in the amir/dpo branch -- you can use it to compare our implementations.
marking this ticket as closed! feel free to open new ones for other questions!
DPO folder is here, Thanks @AmirZur !!
Hi @frankaging, thanks for open source such a useful toolkit. I'm quite curious about how DPO could potentially integrate with REFT within your project. Could you share if there are any plans to incorporate DPO?