gligen / GLIGEN

Open-Set Grounded Text-to-Image Generation
MIT License
1.91k stars 145 forks source link

GLIGEN: Open-Set Grounded Text-to-Image Generation (CVPR 2023)

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li*, Yong Jae Lee* (*Co-senior authors)

[Project Page] [Paper] [Demo] [YouTube Video] Teaser figure

IMAGE ALT TEXT HERE

:fire: News

Requirements

We provide dockerfile to setup environment.

Download GLIGEN models

We provide ten checkpoints for different use scenarios. All models here are based on SD-V-1.4. Mode Modality Download
Generation Box+Text HF Hub
Generation Box+Text+Image HF Hub
Generation Keypoint HF Hub
Inpainting Box+Text HF Hub
Inpainting Box+Text+Image HF Hub
Generation Hed map HF Hub
Generation Canny map HF Hub
Generation Depth map HF Hub
Generation Semantic map HF Hub
Generation Normal map HF Hub

Note that the provided checkpoint for semantic map is only trained on ADE20K dataset; the checkpoint for normal map is only trained on DIODE dataset.

Inference: Generate images with GLIGEN

We provide one script to generate images using provided checkpoints. First download models and put them in gligen_checkpoints. Then run

python gligen_inference.py

Example samples for each checkpoint will be saved in generation_samples. One can check gligen_inference.py for more details about interface.

Training

Grounded generation training

One need to first prepare data for different grounding modality conditions. Refer data for the data we used for different GLIGEN models. Once data is ready, the following command is used to train GLIGEN. (We support multi-GPUs training)

ptyhon main.py --name=your_experiment_name  --yaml_file=path_to_your_yaml_config

The --yaml_file is the most important argument and below we will use one example to explain key components so that one can be familiar with our code and know how to customize training on their own grounding modalities. The other args are self-explanatory by their names. The experiment will be saved in OUTPUT_ROOT/name

One can refer configs/flicker_text.yaml as one example. One can see that there are 5 components defining this yaml: diffusion, model, autoencoder, text_encoder, train_dataset_names and grounding_tokenizer_input. Typecially, diffusion, autoencoder and text_encoder should not be changed as they are defined by Stable Diffusion. One should pay attention to following:

Grounded inpainting training

GLIGEN also supports inpainting training. The following command can be used:

ptyhon main.py --name=your_experiment_name  --yaml_file=path_to_your_yaml_config --inpaint_mode=True  --ckpt=path_to_an_adapted_model

Typecially, we first train GLIGEN on generation task (e.g., text grounded generation) and this model has 4 channels for input conv (latent space of Stable Diffusion), then we modify the saved checkpoint to 9 channels with addition 5 channels initilized with 0. This continue training can lead to faster convergence and better results. path_to_an_adapted_model refers to this modified checkpoint, convert_ckpt.py can be used for modifying checkpoint. NOTE: yaml file is the same for generation and inpainting training, one only need to change --inpaint_mode

Citation

@article{li2023gligen,
  title={GLIGEN: Open-Set Grounded Text-to-Image Generation},
  author={Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae},
  journal={CVPR},
  year={2023}
}

Disclaimer

The original GLIGEN was partly implemented during a part-time internship at Microsoft while the first author was working at The University of Wisconsin-Madison. This repo re-implements GLIGEN in PyTorch with university GPUs. Despite the minor implementation differences, this repo aims to reproduce the results and observations in the paper for research purposes.

Terms and Conditions

We have strict terms and conditions for using the model checkpoints and the demo; it is restricted to uses that follow the license agreement of Latent Diffusion Model and Stable Diffusion.

Broader Impact

It is important to note that our model GLIGEN is designed for open-world grounded text-to-image generation with caption and various condition inputs (e.g. bounding box). However, we also recognize the importance of responsible AI considerations and the need to clearly communicate the capabilities and limitations of our research. While the grounding ability generalizes well to novel spatial configuration and concepts, our model may not perform well in scenarios that are out of scope or beyond the intended use case. We strongly discourage the misuse of our model in scenarios, where our technology could be used to generate misleading or malicious images. We also acknowledge the potential biases that may be present in the data used to train our model, and the need for ongoing evaluation and improvement to address these concerns. To ensure transparency and accountability, we have included a model card that describes the intended use cases, limitations, and potential biases of our model. We encourage users to refer to this model card and exercise caution when applying our technology in new contexts. We hope that our work will inspire further research and discussion on the ethical implications of AI and the importance of transparency and accountability in the development of new technologies.