we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pre-trained text-to-image diffusion models. An IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fine-tuned image prompt model. IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. Moreover, the image prompt can also work well with the text prompt to accomplish multimodal image generation.
# install latest diffusers
pip install diffusers==0.22.1
# install ip-adapter
pip install git+https://github.com/tencent-ailab/IP-Adapter.git
# download the models
cd IP-Adapter
git lfs install
git clone https://huggingface.co/h94/IP-Adapter
mv IP-Adapter/models models
mv IP-Adapter/sdxl_models sdxl_models
# then you can use the notebook
you can download models from here. To run the demo, you should also download the following models:
Best Practice
scale=1.0
and text_prompt=""
(or some generic text prompts, e.g. "best quality", you can also use any negative text prompt). If you lower the scale
, more diverse images can be generated, but they may not be as consistent with the image prompt.scale
to get the best results. In most cases, setting scale=0.5
can get good results. For the version of SD 1.5, we recommend using community models to generate good images.IP-Adapter for non-square images
As the image is center cropped in the default image processor of CLIP, IP-Adapter works best for square images. For the non square images, it will miss the information outside the center. But you can just resize to 224x224 for non-square images, the comparison is as follows:
The comparison of IP-Adapter_XL with Reimagine XL is shown as follows:
Improvements in new version (2023.9.8):
For training, you should install accelerate and make your own dataset into a json file.
accelerate launch --num_processes 8 --multi_gpu --mixed_precision "fp16" \
tutorial_train.py \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5/" \
--image_encoder_path="{image_encoder_path}" \
--data_json_file="{data.json}" \
--data_root_path="{image_path}" \
--mixed_precision="fp16" \
--resolution=512 \
--train_batch_size=8 \
--dataloader_num_workers=4 \
--learning_rate=1e-04 \
--weight_decay=0.01 \
--output_dir="{output_dir}" \
--save_steps=10000
Once training is complete, you can convert the weights with the following code:
import torch
ckpt = "checkpoint-50000/pytorch_model.bin"
sd = torch.load(ckpt, map_location="cpu")
image_proj_sd = {}
ip_sd = {}
for k in sd:
if k.startswith("unet"):
pass
elif k.startswith("image_proj_model"):
image_proj_sd[k.replace("image_proj_model.", "")] = sd[k]
elif k.startswith("adapter_modules"):
ip_sd[k.replace("adapter_modules.", "")] = sd[k]
torch.save({"image_proj": image_proj_sd, "ip_adapter": ip_sd}, "ip_adapter.bin")
This project strives to positively impact the domain of AI-driven image generation. Users are granted the freedom to create images using this tool, but they are expected to comply with local laws and utilize it in a responsible manner. The developers do not assume any responsibility for potential misuse by users.
If you find IP-Adapter useful for your research and applications, please cite using this BibTeX:
@article{ye2023ip-adapter,
title={IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models},
author={Ye, Hu and Zhang, Jun and Liu, Sibo and Han, Xiao and Yang, Wei},
booktitle={arXiv preprint arxiv:2308.06721},
year={2023}
}