Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap Peng Tan, Weipeng Hu
Project Page | paper | arXiv | WebUI | Demo | Video | Diffuser | Colab
Model | Interaction Controllability | FID | KID | |
---|---|---|---|---|
Tiny | Large | |||
v1.0 | 29.53 | 31.56 | 18.69 | 0.00676 |
v1.1 | 30.20 | 31.96 | 17.90 | 0.00635 |
v1.2 | 30.73 | 33.10 | 17.32 | 0.00585 |
Interaction Controllability is measured using FGAHOI detection score. In this table, we measure the Full subset in Default setting on Swin-Tiny and Swin-Large backbone. More details on the protocol is in the paper.
We provide three checkpoints with different training strategies. | Version | Dataset | SD | Download |
---|---|---|---|---|
v1.0 | HICO-DET | v1.4 | HF Hub | |
v1.1 | HICO-DET | v1.5 | HF Hub | |
v1.2 | HICO-DET + VisualGenome | v1.5 | HF Hub |
Note that the experimental results in our paper is referring to v1.0.
We develop an AutomaticA111's Stable Diffuion WebUI extension to allow the use of InteractDiffusion over existing SD models. Check out the plugin at sd-webui-interactdiffusion. Note that it is still on alpha
version.
Some examples generated with InteractDiffusion, together with other DreamBooth and LoRA models. | |||
---|---|---|---|
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
"interactdiffusion/diffusers-v1-2",
trust_remote_code=True,
variant="fp16", torch_dtype=torch.float16
)
pipeline = pipeline.to("cuda")
images = pipeline(
prompt="a person is feeding a cat",
interactdiffusion_subject_phrases=["person"],
interactdiffusion_object_phrases=["cat"],
interactdiffusion_action_phrases=["feeding"],
interactdiffusion_subject_boxes=[[0.0332, 0.1660, 0.3359, 0.7305]],
interactdiffusion_object_boxes=[[0.2891, 0.4766, 0.6680, 0.7930]],
interactdiffusion_scheduled_sampling_beta=1,
output_type="pil",
num_inference_steps=50,
).images
images[0].save('out.jpg')
Change ckpt.pth
in interence_batch.py to selected checkpoint.
Made inference on InteractDiffusion to synthesis the test set of HICO-DET based on the ground truth.
python inference_batch.py --batch_size 1 --folder generated_output --seed 489 --scheduled-sampling 1.0 --half
Setup FGAHOI at ../FGAHOI
. See FGAHOI repo on how to setup FGAHOI and also HICO-DET dataset in data/hico_20160224_det
.
Prepare for evaluate on FGAHOI. See id_prepare_inference.ipynb
Evaluate on FGAHOI.
python main.py --backbone swin_tiny --dataset_file hico --resume weights/FGAHOI_Tiny.pth --num_verb_classes 117 --num_obj_classes 80 --output_dir logs --merge --hierarchical_merge --task_merge --eval --hoi_path data/id_generated_output --pretrain_model_path "" --output_dir logs/id-generated-output-t
Evaluate for FID and KID. We recommend to resize hico_det dataset to 512x512 before perform image quality evaluation, for a fair comparison. We use torch-fidelity.
fidelity --gpu 0 --fid --isc --kid --input2 ~/data/hico_det_test_resize --input1 ~/FGAHOI/data/data/id_generated_output/images/test2015
This should provide a brief overview of how the evaluation process works.
Run the following command:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 main.py --yaml_file configs/hoi_hico_text.yaml --ckpt <existing_gligen_checkpoint> --name test --batch_size=4 --gradient_accumulation_step 2 --total_iters 500000 --amp true --disable_inference_in_training true --official_ckpt_name <existing SD v1.4/v1.5 checkpoint>
@InProceedings{Hoe_2024_CVPR,
author = {Hoe, Jiun Tian and Jiang, Xudong and Chan, Chee Seng and Tan, Yap-Peng and Hu, Weipeng},
title = {InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {6180-6189}
}
This work is developed based on the codebase of GLIGEN and LDM.