VLFeedback

A GPT-4V annotated preference dataset for large vision language models.

[Project Page] [Datasets] [Silkie Model] [[Paper]]()

Annotation Framework

Multimodal Instruciton Source

The instructions are sampled from various domains to cover different capabilities of LVLMs

Model Pool

We construct a model pool consists of 12 LVLMs, including

GPT-4V
LLaVA-series
- LLaVA-v1.5-7B
- LLaVA-v1.5-13B
- LLaVA-RLHF-7b-v1.5-224
- LLaVA-RLHF-13b-v1.5-336
Qwen-VL-7B
IDEFICS-9b-Instruct
Fuyu-8B
InstructBLIP-serise
- InstructBLIP-Vicuna-7B
- InstructBLIP-Vicuna-13B
VisualGLM-6B
MMICL-Vicuna-13B

Silkie

We select Qwen-VL-Chat as the backbone model and perform DPO on our dataset.

Generated by DALL·E 3

The resulting model, Silkie, achieves comprehensive improvements on various benchmarks

Installation

To run our training scripts, create a virtual environment and install the dependencies first.

conda create -n silkie python=3.10  && conda activate silkie
pip install -r requirements.txt

Training

Our training scripts support both single-node and multi-node training. We provide a launch_dpo.py script that handles both cases. If you want to launch a job locally, you can use:

python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR

If you want to launch a job on a Slurm cluster, specify GPUS_PER_NODE in launch_dpo.py and run:

python launch_dpo.py --config dpo_config/example.yaml --working $WORKING_DIR --gpus $NUM_GPUS

Citations

@article{2023vlfeedback,
  author      = {Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and  Yazheng Yang and  Benyou Wang and  Lingpeng Kong},
  title       = {Silkie: Preference Distillation for Large Visual Language Models},
  publisher   = {arXiv:2312.10665},
  year        = {2023}
}

Acknowledgements

We would like to thank the authors of trl and Qwen-VL for their great work.

vlf-silkie / VLFeedback

readme