Different with online dpo papers

huggingface / trl

Train transformer language models with reinforcement learning.

http://hf.co/docs/trl

Apache License 2.0

9.98k stars 1.26k forks source link

Different with online dpo papers #2018

Closed mst272 closed 2 weeks ago

mst272 commented 2 months ago

I see that the paper says that the Annotator can be adjusted through prompt. But the implementation of trl is score. Is this different from the paper?

qgallouedec commented 2 months ago

Indeed, it's different from the paper for now as we will soon implement Online DPO with judge (ie, LLM annotator). The PR will be linked to this issue.