[FEATURE REQUEST] DPOTrainer allowing images for multimodal models

nnethercott commented 6 months ago

Currently the DPOTrainer is incredibly convenient for fine tuning llms on language preference datasets but no support exists right now for using this class to train multimodal chatbots like LLaVA. In theory the process should be pretty straightforward given that the image just prefixes the language inputs to the model so the DPO loss mechanism remains unchanged. Plus building multimodal preference datasets following the same recipe as Intel/orca_dpo_pairs with gemini or gpt4 is fairly easy to do so I imagine there'll be a need for multimodal support in the trainer moving forwards.

I was thinking of forking the trl to do a quick and dirty implementation of the feature but figured I'd mention something here just in case someone else has done it or if its in the pipeline for the official project.

VoVoR commented 6 months ago

You might be interested to visit this project: https://github.com/vlf-silkie/VLFeedback/tree/main They've adapted a DPOTrainer to tune Qwen-VL on the Multi-modal preferences dataset.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

huggingface / trl

[FEATURE REQUEST] DPOTrainer allowing images for multimodal models #1359