huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
8.84k stars 1.09k forks source link

what's the difference between PPO Trainer and PPOv2 Trainer? #1793

Open cxjtju opened 3 weeks ago

cxjtju commented 3 weeks ago

what's the difference between PPO Trainer and PPOv2 Trainer?

vwxyzjn commented 3 weeks ago

Hi, @cxjtju . The PPOv2Trainer is the new experimental PPO trainer we now recommend to the users. It's a refactor of PPOTrainer and PPOv2Trainer introduces more uniform APIs, better logging, documentations, and more benchmark results.

js0nwu commented 5 days ago

The original PPO trainer supports using an arbitrary reward function (e.g., not an HF model) which is a lot more flexible. I was wondering if there is a way to achieve similar functionality using the new API.

vwxyzjn commented 5 days ago

The original PPO trainer supports using an arbitrary reward function (e.g., not an HF model) which is a lot more flexible. I was wondering if there is a way to achieve similar functionality using the new API.

Right now, there is not a convenient API for this, but to hack should be pretty easy: you can modify the following code with an arbitrary reward function.

https://github.com/huggingface/trl/blob/5828a666bff52eb18c1107317c0dfb54f57430b8/trl/trainer/ppov2_trainer.py#L326-L328

js0nwu commented 5 days ago

thanks for the suggestion @vwxyzjn - i was thinking of doing this for now but it seems, as you said, a bit "hacky." hopefully it might be a consideration of a ppov2's future api design