Generated by DALL·E 3
This repository contains the code for the paper titled "Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization". [Link to our paper]
conda create -n bpo python=3.10 -y
conda activate bpo
pip install -e .
Install flash attention for efficient training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Download ShareGPT4V from here
Download COCO from here
Download dataset annotation from here
Extract data from ShareGPT4V and organize the images as follows:
Image_root
├── coco/
├──train2017/
├── llava/
├──llava_pretrain/
├── sam/
├── share_textvqa/
├──images/
├── web-celebrity/
├──images/
├── web-landmark/
├──images/
├── wikiart/
├──images/
bash scripts/finetune_bpo.sh
bash scripts/finetune_bpo_flash.sh
The project is built on top of the amazing multimodal large language model LLaVA, RLHF package trl, DPO for multimodal learning Silkie, and visual contrastive decoding VCD. Thanks for these great work!
If you find our work useful for your research or applications, please cite using this BibTeX:
@misc{pi2024strengthening,
title={Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization},
author={Renjie Pi and Tianyang Han and Wei Xiong and Jipeng Zhang and Runtao Liu and Rui Pan and Tong Zhang},
year={2024},
eprint={2403.08730},
archivePrefix={arXiv},
primaryClass={cs.CL}
}