starriver030515 / SynthVLM

29 stars 1 forks source link

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models


πŸš€πŸš€πŸš€ Official implementation of SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models.

Quick Usage

Data Preparation

Prepare your captions in JSON format. Here’s an example of how your JSON should look:

[
    {
        "id": 1,
        "caption": "The bus in the image is white and red. The back of the bus features an advertisement. The bus is driving down the street, which is crowded with people and other vehicles."
    },
    {
        "id": 2,
        "caption": "The dog in the image is brown with a red collar. It sits behind a window, looking out longingly, which gives it a sense of longing for the outdoors or something it sees."
    },
]

Data Generation Instructions

To generate images, run the run.sh script. The settings can be adjusted as follows:

Model Selection

The process supports two diffusion models:

License

Code License Data License Usage and License Notices: The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.