Support for MiniCPM-V Reinforcement Learning with Direct Preference Optimization (DPO)

huggingface / trl

Train transformer language models with reinforcement learning.

http://hf.co/docs/trl

Apache License 2.0

10.04k stars 1.27k forks source link

Support for MiniCPM-V Reinforcement Learning with Direct Preference Optimization (DPO) #2326

Open DarioPTWR opened 1 week ago

DarioPTWR commented 1 week ago

Feature request

Hi! I’d like to request support for reinforcement learning with DPO for the MiniCPM-V model. I'm not sure if the current state of this repository enables for this vision model to be retrained as well, could I get some advice / insights into that? Would the current approach for applying DPO to VLMs work for the majority of VLMs on HuggingFace?

Motivation

None

Your contribution

None

qgallouedec commented 1 week ago

We've have an example script to train VLM with DPO here. Have you tried to run it with MiniCPM-V? At present, we're not claiming that you can use it with any VLM, as the level of standardization of VLMs is lower than that of LLMs. But it's definitely worth giving this one a try.

DarioPTWR commented 3 days ago

Alright cool! Will try it out and provide an update, thanks for your response!