Create a conda environment vico
using
conda env create -f environment.yaml
conda activate vico
Download the pretrained stable diffusion v1-4 under models/ldm/stable-diffusion-v1
.
We provide the pretrained checkpoints at 300, 350, and 400 steps of 8 objects. You can download the sample images and their corresponding pretrained checkpoints. You can also download the data of any object:
Object | Sample images | Checkpoints |
---|---|---|
barn | image | ckpt |
batman | image | ckpt |
clock | image | ckpt |
dog7 | image | ckpt |
monster toy | image | ckpt |
pink sunglasses | image | ckpt |
teddybear | image | ckpt |
wooden pot | image | ckpt |
Datasets are originally collected and provided by Textual Inversion, DreamBooth, and Custom Diffsuion. You can find all datasets used for quantitaive comparison in our paper.
Before running the inference command, please set:
REF_IMAGE_PATH
: Path of the reference image. It can be any image in the samples like batman/1.jpg
.CHECKPOINT_PATH
: Path of the checkpoint weight. Its
subfolder should be similar to checkpoints/*-399.pt
.OUTPUT_PATH
: Path of the generated images. For example, it can be like outputs/batman
.
python scripts/vico_txt2img.py \
--ddim_eta 0.0 --n_samples 4 --n_iter 2 --scale 7.5 --ddim_steps 50 \
--ckpt_path models/ldm/stable-diffusion-v1/sd-v1-4.ckpt \
--image_path REF_IMAGE_PATH \
--ft_path CHECKPOINT_PATH \
--load_step 399 \
--prompt "a photo of * on the beach" \
--outdir OUTPUT_PATH
You can specify load_step
(300,350,400) and personalize prompt
(a prefix "a photo of" usually makes better results).
Before running the training command, please set:
RUN_NAME
: Your run name. Will be the name of the folder of logs.GPUS_USED
: GPUs you are using, e.g., "0,1,2,3". (4 RTX 3090 GPUs in my case)TRAIN_DATA_ROOT
: Path of your training images.INIT_WORD
: Initialize the word to represent your unique object, e.g., "dog" and "toy".
python main.py \
--base configs/stable-diffusion/v1-finetune.yaml -t \
--actual_resume models/ldm/stable-diffusion-v1/sd-v1-4.ckpt \
-n RUN_NAME \
--gpus GPUS_USED \
--data_root TRAIN_DATA_ROOT \
--init_word INIT_WORD
If you use this code in your research, please consider citing our paper:
@inproceedings{Hao2023ViCo,
title={ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation},
author={Shaozhe Hao and Kai Han and Shihao Zhao and Kwan-Yee K. Wong},
year={2023}
}
This code repository is based on the great work of Textual Inversion. Thanks!