Closed mst272 closed 2 weeks ago
I see that the paper says that the Annotator can be adjusted through prompt. But the implementation of trl is score. Is this different from the paper?
Indeed, it's different from the paper for now as we will soon implement Online DPO with judge (ie, LLM annotator). The PR will be linked to this issue.
I see that the paper says that the Annotator can be adjusted through prompt. But the implementation of trl is score. Is this different from the paper?