Hsin-Ying Lee, Hung-Yu Tseng, Hsin-Ying Lee, Ming-Hsuan Yang
CVPR 2024
This is the official implementation of Exploiting Diffusion Prior for Generalizable Dense Prediction.
Our implementation is based on Python 3.10 and CUDA 11.3.
Required
diffusers==0.20.0
pytorch==1.12.1
torchvision==0.13.1
transformers==4.31.0
Optional
accelerate # for training
gradio # for demo
omegaconf # for configuration
xformers # for acceleration
We provide the model weights of five tasks for reproducing the results in the paper. These checkpoints are trained with 10K synthesized bedroom images, prompts, and pseudo ground truths.
Besides, for normal and depth prediction, we provide the weights trained with more diverse scenes and without prompts, which are more suitable for practical use cases.
Download the weights from this google drive and place them in the root directory.
For checkpoints with -notext
, set disable_prompts=True
.
from PIL import Image
from pipeline import Pipeline
LORA_DIR = 'ckpt/normal-scene100-notext'
disable_prompts = LORA_DIR.endswith('-notext')
ppl = Pipeline(
disable_prompts=disable_prompts,
lora_ckpt=LORA_DIR,
device='cuda',
mixed_precision='fp16',
)
img = Image.open('/path/to/img')
For depth prediction,
output_np_array = ppl(img, inference_step=5, target_mode='F')
Otherwise,
output_pil_img = ppl(img, inference_step=5, target_mode='RGB')
Alternatively, we provide Gradio demo. You can launch it with
python app.py
and access the app at localhost:7860
.
We conduct the experiments with synthetic images, so we can control and analyze the performance of different data domains. We first generate prompts with scene keywords. Then we generate images with the prompts.
To generate prompts,
python tools/gencap.py KEYWORD -n NUMBER_OF_PROMPTS -o OUTPUT_TXT
KEYWORD
can be a single word or a text file containing multiple words separated by lines.
To generate images,
python tools/txt2img.py --from-file PROMPTS_TXT --output OUTPUT_DIR --batch-size BSZ
These two scripts are some wrappers of huggingface's transformers and diffusers.
Then make a meta file to record images and prompts. Prompts are not necessary if you set disable-prompts
(see the section of training).
python tools/makemeta.py --imgs IMAGE_DIR [--captions PROMPTS]
It collects the png and jpg files in IMAGE_DIR
, sort them by their file names, and generates a metadata.jsonl
in IMAGE_DIR
with the same format as huggingface's ImageFolder. If prompts are provided, it should be in the same order as the file names.
Then we generate pseudo ground truths with the following code bases.
surface normals: 3DCommonCorruptions
depths: ZoeDepth
albedo and shading: PIE-Net
semantic segmentation: EVA-02
For normals, albedo and shading, clone the repo, set up the environments, and put getnorm.py
and getintr.py
in each directory.
For depths, getdepth.py
can be run in isolation.
python tools/get{norm,depth,intr}.py -i INPUT_IMG_DIR -o OUTPUT_DIR
These scripts store the predictions in lmdb by default. The keys of predictions are the file names without extensions. The keys of albedo and shading outputs get an extra -r
(reflectance) and-s
(shading) suffix. Use --save-files
to save outputs in files.
For semantic segmentation, generate segmentation maps with eva02_L_ade_seg_upernet_sz512
in EVA-02.
Then collect the segmentation maps in lmdb.
python tools/makedb.py INPUT_DIR OUTPUT_DB
To reproduce the trained models, the following script is the basic setting for all tasks. The script is adaped from an example provided by huggingface.
DATA_DIR="/path/to/data"
TARGET_DB="/path/to/target"
OUTPUT_DIR="/path/to/output"
accelerate launch --mixed_precision="fp16" train.py \
--train_data_dir=$DATA_DIR \
--train_batch_size=8 \
--max_train_steps=50000 \
--learning_rate=1e-04 \
--lr_scheduler="cosine" \
--lr_warmup_steps=0 \
--output_dir=$OUTPUT_DIR \
--target_db=$TARGET_DB \
--prediction_type="v_prediction"
Additionally, for depths, set --target_mode=F
and --target_scale=8
.
For depths, albedo, shading, and segmentation, set --random_flip
.
For albedo, set --target_extra_key=r
.
For shading, set --target_extra_key=s
.
To add and train lora for only self-attention, set --self_attn_only
.
To disable prompts, set --disable_prompts
.
To enable xformers, set --enable_xformers_memory_efficient_attention
.
To generate predictions, run infer.py
with the same options you run train.py
.
DATA_DIR="/path/to/source/images" # optinoal
PROMPTS="/path/to/prompts.txt" # optional
LORA_DIR="/path/to/train/output"
OUTPUT_DIR="/path/to/output"
python infer.py \
--src $DATA_DIR \
--prompts $PROMPTS \
--lora-ckpt $LORA_DIR \
--output $OUTPUT_DIR \
--config config.yaml \
--batch-size 4
For depths, set --target-mode=F
, --target-scale=8
. It generates depths and saves in numpy compressed npz
format with key x
.
Optionally set --target-pred-type
, --self-attn-only
, and --disable-prompts
that aligns training. If you don't provide --src
, it will generate images with the original (no lora) model from --prompts
. If you don't set --disable-prompts
but forget to provide --prompts
, it will raise an error.
More settings for the generation process such as the number of generation steps and guidance scales are in config.yaml
.
Besides, in the paper we construct the samples of previous diffusion steps with input images and estimated output predictions, but we empirically found using the orignial DDIM, which estimates both input images and output predictions, gives slightly worse in-domain performance but slightly better generalizability. The difference is little, though. The results in the paper were generated by the original DDIM. Set --use-oracle-ddim
to use exactly the same generation process of the paper.
Also note that the words in the options are connected by hyphens -
, not underscores _
.
The evaluation script runs on GPU. For normals,
python test/eval.py PRED GROUND_TRUTH --metrics l1 angular
For depths,
python test/eval.py PRED GROUND_TRUTH --metrics rel delta --ext npz --abs
python test/eval.py PRED GROUND_TRUTH --metrics rmse --ext npz --abs --norm
For albedo and shading,
python test/eval.py PRED GROUND_TRUTH --metrics mse
For segmentation, turn output images into class maps.
python tools/color2cls.py INPUT_DIR OUTPUT_DIR --pal 2 --ext npy --filter
Then calculate miou.
python test/miou.py PRED GROUND_TRUTH
The mIoU evaluation is borrowed from MMSegmentation.
This repo contains the code from diffusers and MMSegmentation.
@InProceedings{lee2024dmp,
author = {Lee, Hsin-Ying and Tseng, Hung-Yu and Lee, Hsin-Ying and Yang, Ming-Hsuan},
title = {Exploiting Diffusion Prior for Generalizable Dense Prediction},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2024},
}