huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.24k stars 1.16k forks source link

[Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training? #1595

Closed aplmikex closed 3 months ago

aplmikex commented 4 months ago

Hi Hugging Face team,

I'm exploring the possibility of using the TRL library for training a reinforcement learning model with a simulated environment. Specifically, I'm interested in using the DPO (Deep Policy Optimization) trainer to generate training data based on the simulated environment, similar to how PPO (Proximal Policy Optimization) works with its step-based training.

However, after reviewing the TRL documentation and examples, I couldn't find any clear indication of whether this is supported or not. I'd like to know if it's possible to use the DPO trainer in TRL to generate training data based on a simulated environment, where the environment provides rewards and observations that can be used to update the policy.

If this is supported, could you please provide an example or point me to the relevant documentation? If not, are there any plans to add this feature in the future?

Additional context:

I've reviewed the TRL documentation and examples, but couldn't find any mention of using a simulated environment with the DPO trainer. I've seen examples of using PPO with a simulated environment, where the environment provides rewards and observations that are used to update the policy. I'm interested in using TRL because of its ease of use and flexibility, but I need to know if it can support my specific use case.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.