πππ Official implementation of SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models.
Prepare your captions in JSON format. Hereβs an example of how your JSON should look:
[
{
"id": 1,
"caption": "The bus in the image is white and red. The back of the bus features an advertisement. The bus is driving down the street, which is crowded with people and other vehicles."
},
{
"id": 2,
"caption": "The dog in the image is brown with a red collar. It sits behind a window, looking out longingly, which gives it a sense of longing for the outdoors or something it sees."
},
]
To generate images, run the run.sh
script. The settings can be adjusted as follows:
width
and height
in the script to customize.repeat
parameter to increase the number of images per caption, selecting the best quality image for each.The process supports two diffusion models:
Usage and License Notices: The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.