GRPO as part of HF TRL?

huggingface / trl

Train transformer language models with reinforcement learning.

http://hf.co/docs/trl

Apache License 2.0

9.56k stars 1.2k forks source link

Open JumpingRain opened 1 week ago

JumpingRain commented 1 week ago

Qwen2.5-Math and Qwen2.5-Code are two state-of-the-art models that have recently integrated GRPO (Group Relative Policy Optimization)

This is a request-only post, so I don't contribute anything to it.

lewtun commented 6 days ago

Hello @JumpingRain there is an open PR for this in #1954 that is currently under development