huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10.1k stars 1.28k forks source link

[Feature] Add DiscoPOP algorithm #1796

Closed khalil-Hennara closed 1 month ago

khalil-Hennara commented 4 months ago

I've read this paper recently https://arxiv.org/abs/2406.08414, I am wandering if I can work on adding this algorithm to the framework, I want to implement for myself, but I can add it to the framework as a dependent trainer object.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

qgallouedec commented 2 months ago

As I understand it, DiscoPOP is a meta-optimisation method, i.e. it rewrites online the objective function, among other things. I have no idea how to make it work in TRL. If you have a reference implementation or are willing to work on it, please do. I can't guarantee that it would make sense to merge it in the future, but it's certainly something that could benefit the community (maybe as a research project, or as a link in the description).

qgallouedec commented 1 month ago

I'm closing because this feature hasn't received any PR, and doesn't seem to be widely requested by the community. This issue may be reopened at a later date if that changes.