huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.56k stars 1.2k forks source link

GRPO as part of HF TRL? #2103

Open JumpingRain opened 1 week ago

JumpingRain commented 1 week ago

Feature request

Qwen2.5-Math and Qwen2.5-Code are two state-of-the-art models that have recently integrated GRPO (Group Relative Policy Optimization)

Motivation

https://qwenlm.github.io/blog/qwen2.5-math/ https://[arxiv.org/pdf/2402.03300](https://arxiv.org/pdf/2402.03300)

Your contribution

This is a request-only post, so I don't contribute anything to it.

lewtun commented 6 days ago

Hello @JumpingRain there is an open PR for this in #1954 that is currently under development