Open cxjtju opened 3 weeks ago
Hi, @cxjtju . The PPOv2Trainer is the new experimental PPO trainer we now recommend to the users. It's a refactor of PPOTrainer and PPOv2Trainer introduces more uniform APIs, better logging, documentations, and more benchmark results.
The original PPO trainer supports using an arbitrary reward function (e.g., not an HF model) which is a lot more flexible. I was wondering if there is a way to achieve similar functionality using the new API.
The original PPO trainer supports using an arbitrary reward function (e.g., not an HF model) which is a lot more flexible. I was wondering if there is a way to achieve similar functionality using the new API.
Right now, there is not a convenient API for this, but to hack should be pretty easy: you can modify the following code with an arbitrary reward function.
thanks for the suggestion @vwxyzjn - i was thinking of doing this for now but it seems, as you said, a bit "hacky." hopefully it might be a consideration of a ppov2's future api design
what's the difference between PPO Trainer and PPOv2 Trainer?