Exploiting Diffusion Prior for Generalizable Dense Prediction

Hsin-Ying Lee, Hung-Yu Tseng, Hsin-Ying Lee, Ming-Hsuan Yang

CVPR 2024

This is the official implementation of Exploiting Diffusion Prior for Generalizable Dense Prediction.

Quick Start

Installation

Our implementation is based on Python 3.10 and CUDA 11.3.

Required

diffusers==0.20.0
pytorch==1.12.1
torchvision==0.13.1
transformers==4.31.0

Optional

accelerate  # for training
gradio      # for demo
omegaconf   # for configuration
xformers    # for acceleration

Checkpoints

We provide the model weights of five tasks for reproducing the results in the paper. These checkpoints are trained with 10K synthesized bedroom images, prompts, and pseudo ground truths.

Besides, for normal and depth prediction, we provide the weights trained with more diverse scenes and without prompts, which are more suitable for practical use cases.

Download the weights from this google drive and place them in the root directory.

Run

For checkpoints with -notext, set disable_prompts=True.

from PIL import Image
from pipeline import Pipeline

LORA_DIR = 'ckpt/normal-scene100-notext'
disable_prompts = LORA_DIR.endswith('-notext')
ppl = Pipeline(
    disable_prompts=disable_prompts,
    lora_ckpt=LORA_DIR,
    device='cuda',
    mixed_precision='fp16',
)
img = Image.open('/path/to/img')

For depth prediction,

output_np_array = ppl(img, inference_step=5, target_mode='F')

Otherwise,

output_pil_img = ppl(img, inference_step=5, target_mode='RGB')

Alternatively, we provide Gradio demo. You can launch it with

python app.py

and access the app at localhost:7860.

Data Generation

Images

We conduct the experiments with synthetic images, so we can control and analyze the performance of different data domains. We first generate prompts with scene keywords. Then we generate images with the prompts.

To generate prompts,

python tools/gencap.py KEYWORD -n NUMBER_OF_PROMPTS -o OUTPUT_TXT

KEYWORD can be a single word or a text file containing multiple words separated by lines.

To generate images,

python tools/txt2img.py --from-file PROMPTS_TXT --output OUTPUT_DIR --batch-size BSZ

These two scripts are some wrappers of huggingface's transformers and diffusers.

Then make a meta file to record images and prompts. Prompts are not necessary if you set disable-prompts (see the section of training).

python tools/makemeta.py --imgs IMAGE_DIR [--captions PROMPTS]

It collects the png and jpg files in IMAGE_DIR, sort them by their file names, and generates a metadata.jsonl in IMAGE_DIR with the same format as huggingface's ImageFolder. If prompts are provided, it should be in the same order as the file names.

Pseudo Ground Truths

Then we generate pseudo ground truths with the following code bases.

surface normals: 3DCommonCorruptions
depths: ZoeDepth
albedo and shading: PIE-Net
semantic segmentation: EVA-02

For normals, albedo and shading, clone the repo, set up the environments, and put getnorm.py and getintr.py in each directory.

For depths, getdepth.py can be run in isolation.

python tools/get{norm,depth,intr}.py -i INPUT_IMG_DIR -o OUTPUT_DIR

These scripts store the predictions in lmdb by default. The keys of predictions are the file names without extensions. The keys of albedo and shading outputs get an extra -r (reflectance) and-s (shading) suffix. Use --save-files to save outputs in files.

For semantic segmentation, generate segmentation maps with eva02_L_ade_seg_upernet_sz512 in EVA-02.

Detailed instructions

1. Download `eva02_L_ade_seg_upernet_sz512.pth` 2. Make a directory with an arbitrary name, e.g. `dummy`, and make another directory named `images` under it. ```shell mkdir -p dummy/images ``` 3. Link the directory of input images as `validation` in `dummy/images` ```shell ln -s INPUT_IMG_DIR dummy/images/validation ``` 4. Modify `data_root` at line 3 in `configs/_base_/datasets/ade20k.py` to be `dummy` 5. Run `test.py` in the [EVA-02 repo](https://github.com/baaivision/EVA/tree/master/EVA-02/seg) with ```shell python test.py \ configs/eva02/upernet/upernet_eva02_large_24_512_slide_80k.py \ eva02_L_ade_seg_upernet_sz512.pth \ --show-dir OUTPUT_DIR \ --opacity 1 ``` 6. Convert the segmentation maps with better color mapping for classes commonly seen in bedrooms. ```shell python tools/color2cls.py INPUT_DIR OUTPUT_DIR --pal 1 --ext png python tools/cls2color.py INPUT_DIR OUTPUT_DIR --pal 2 ```

Then collect the segmentation maps in lmdb.

python tools/makedb.py INPUT_DIR OUTPUT_DB

Training

To reproduce the trained models, the following script is the basic setting for all tasks. The script is adaped from an example provided by huggingface.

DATA_DIR="/path/to/data"
TARGET_DB="/path/to/target"
OUTPUT_DIR="/path/to/output"

accelerate launch --mixed_precision="fp16" train.py \
    --train_data_dir=$DATA_DIR \
    --train_batch_size=8 \
    --max_train_steps=50000 \
    --learning_rate=1e-04 \
    --lr_scheduler="cosine" \
    --lr_warmup_steps=0 \
    --output_dir=$OUTPUT_DIR \
    --target_db=$TARGET_DB \
    --prediction_type="v_prediction"

Additionally, for depths, set --target_mode=F and --target_scale=8.

For depths, albedo, shading, and segmentation, set --random_flip.

For albedo, set --target_extra_key=r.

For shading, set --target_extra_key=s.

To add and train lora for only self-attention, set --self_attn_only.

To disable prompts, set --disable_prompts.

To enable xformers, set --enable_xformers_memory_efficient_attention.

Inference

To generate predictions, run infer.py with the same options you run train.py.

DATA_DIR="/path/to/source/images" # optinoal
PROMPTS="/path/to/prompts.txt" # optional
LORA_DIR="/path/to/train/output"
OUTPUT_DIR="/path/to/output"

python infer.py \
    --src $DATA_DIR \
    --prompts $PROMPTS \
    --lora-ckpt $LORA_DIR \
    --output $OUTPUT_DIR \
    --config config.yaml \
    --batch-size 4

For depths, set --target-mode=F, --target-scale=8. It generates depths and saves in numpy compressed npz format with key x.

Optionally set --target-pred-type, --self-attn-only, and --disable-prompts that aligns training. If you don't provide --src, it will generate images with the original (no lora) model from --prompts . If you don't set --disable-prompts but forget to provide --prompts, it will raise an error.

More settings for the generation process such as the number of generation steps and guidance scales are in config.yaml.

Besides, in the paper we construct the samples of previous diffusion steps with input images and estimated output predictions, but we empirically found using the orignial DDIM, which estimates both input images and output predictions, gives slightly worse in-domain performance but slightly better generalizability. The difference is little, though. The results in the paper were generated by the original DDIM. Set --use-oracle-ddim to use exactly the same generation process of the paper.

Also note that the words in the options are connected by hyphens -, not underscores _.

Evaluation

The evaluation script runs on GPU. For normals,

python test/eval.py PRED GROUND_TRUTH --metrics l1 angular

For depths,

python test/eval.py PRED GROUND_TRUTH --metrics rel delta --ext npz --abs
python test/eval.py PRED GROUND_TRUTH --metrics rmse --ext npz --abs --norm

For albedo and shading,

python test/eval.py PRED GROUND_TRUTH --metrics mse

For segmentation, turn output images into class maps.

python tools/color2cls.py INPUT_DIR OUTPUT_DIR --pal 2 --ext npy --filter

Then calculate miou.

python test/miou.py PRED GROUND_TRUTH

The mIoU evaluation is borrowed from MMSegmentation.

Acknowledgement

This repo contains the code from diffusers and MMSegmentation.

Citation

@InProceedings{lee2024dmp,
  author    = {Lee, Hsin-Ying and Tseng, Hung-Yu and Lee, Hsin-Ying and Yang, Ming-Hsuan},
  title     = {Exploiting Diffusion Prior for Generalizable Dense Prediction},
  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2024},
}

shinying / dmp

readme