[Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training?

Hi Hugging Face team,

I'm exploring the possibility of using the TRL library for training a reinforcement learning model with a simulated environment. Specifically, I'm interested in using the DPO (Deep Policy Optimization) trainer to generate training data based on the simulated environment, similar to how PPO (Proximal Policy Optimization) works with its step-based training.

However, after reviewing the TRL documentation and examples, I couldn't find any clear indication of whether this is supported or not. I'd like to know if it's possible to use the DPO trainer in TRL to generate training data based on a simulated environment, where the environment provides rewards and observations that can be used to update the policy.

If this is supported, could you please provide an example or point me to the relevant documentation? If not, are there any plans to add this feature in the future?

Additional context:

I've reviewed the TRL documentation and examples, but couldn't find any mention of using a simulated environment with the DPO trainer. I've seen examples of using PPO with a simulated environment, where the environment provides rewards and observations that are used to update the policy. I'm interested in using TRL because of its ease of use and flexibility, but I need to know if it can support my specific use case.

huggingface / trl

[Question] Does TRL support DPO trainer with simulated environment for generating training data like PPO's step-based training? #1595