janghyuncho / DECOLA

Code release for "Language-conditioned Detection Transformer"
82 stars 4 forks source link

Language-conditioned Detection Transformer

Language-conditioned Detection Transformer
Jang Hyun Cho and Philipp Krähenbühl
CVPR 2024 ([pdf][supp])

What is DECOLA?

We design a new open-vocabulary detection framework that adjusts the inner mechanism of the object detector to the concepts it reasons over. This language-conditioned detector (DECOLA) trains as easily as classical detectors, but generalizes much better to novel concepts. DECOLA trains in three steps: (1) Learning to condition to a set of concept. (2) pseudo-labeling image-level data to scale-up training data. (3) learning general-purpose detector for downstream open-vocabulary detection. We show strong zero-shot performance in open-vocabulary and standard LVIS benchmarks. [Full abstract]

TL;DR: We design a special detector for pseudo-labeling and scale-up open-vocabulary detection through self-training.

Please feel free to reach out for any questions or discussions!

šŸ“§ Jang Hyun Cho [email]

šŸ”„ News šŸ”„



See installation instructions.


We provide demo based on detectron2 demo interface.

DECOLA Phase 1: Language-conditioned detection.

First, please download appropriate model checkpoint. Then, you can run demo as following

python demo.py --config-file configs/DECOLA_PHASE1_L_CLIP_SwinB_4x.yaml --input figs/input/pizza.jpg --output figs/output/pizza.jpg --vocabulary custom --custom_vocabulary cola,piza,fork,knif,table --confidence-threshold 0.3 --language-condition --opts MODEL.WEIGHTS weights/DECOLA_PHASE1_L_CLIP_SwinB_4x.pth 

Above model is DECOLA Phase 1 with Swin-B backbone (config), which has learned only from LVIS dataset. If setup properly, the output image should look like below:

Note that cola is not in LVIS vocabulary as well as piza and knif have intended typos. Similarly,

python demo.py --config-file configs/DECOLA_PHASE1_L_CLIP_SwinB_4x.yaml --input figs/input/cola.jpg --output figs/output/cola.jpg --vocabulary custom --custom_vocabulary cola,cat,mentos,table --confidence-threshold 0.3 --language-condition --opts MODEL.WEIGHTS weights/DECOLA_PHASE1_L_CLIP_SwinB_4x.pth 

Above DECOLA predicts mentos and cola successfully, which are again outside LVIS vocabulary.

DECOLA Phase 2: General-purpose detection.

General-purpose detection with Phase 2 of DECOLA is also available for both custom vocabulary

python demo.py --config-file configs/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.yaml --input figs/input/desk.jpg --output figs/output/desk1.jpg --vocabulary custom --custom_vocabulary water_bottle,wallet,webcam,mug,headphone,drawer,keyboard,laptop,straw,mouse,paper,plastic_bag --confidence-threshold 0.2 --opts MODEL.WEIGHTS weights/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.pth 

and a pre-defined vocabulary (e.g., LVIS).

python demo.py --config-file configs/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.yaml --input figs/input/desk.jpg --output figs/output/desk2.jpg --vocabulary lvis --confidence-threshold 0.2 --opts MODEL.WEIGHTS weights/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.pth 

Integrating Segment Anything Model

We combine DECOLA's powerful language-conditioned, open-vocabulary detection and Segment Anything Model (SAM). DECOLA's box output prompts SAM to generate high-quality class-aware instance segmentation. Simply install SAM and add --use-sam flag:

python demo.py --config-file configs/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.yaml --input figs/input/desk.jpg --output figs/output_sam/desk2.jpg --vocabulary lvis --confidence-threshold 0.2 --use-sam --opts MODEL.WEIGHTS weights/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.pth 

Image credit: David Fouhey.

Training DECOLA

Please prepare datasets first, and follow training scripts to reproduce our results.

Testing DECOLA

Check out for all the checkpoints of our model as well as baselines.

Here are the highlight results:

Open-vocabulary LVIS with Deformable DETR

name backbone box AP_novel box mAP
baseline ResNet-50 9.4 32.2
+ self-train ResNet-50 23.2 36.2
DECOLA (ours) ResNet-50 27.6 38.3
baseline Swin-B 16.2 41.1
+ self-train Swin-B 30.8 42.3
DECOLA (ours) Swin-B 35.7 46.3
baseline Swin-L 21.9 49.6
+ self-train Swin-L 36.5 51.8
DECOLA (ours) Swin-L 46.9 55.2

Direct zero-shot transfer to LVIS minival

name backbone data AP_r AP_c AP_f mAP
DECOLA Swin-T O365, IN21K 32.8 32.0 31.8 32.0
DECOLA Swin-L O365, OID, IN21K 41.5 38.0 34.9 36.8

Direct zero-shot transfer to LVIS v1.0

name backbone data AP_r AP_c AP_f mAP
DECOLA Swin-T O365, IN21K 27.2 24.9 28.0 26.6
DECOLA Swin-L O365, OID, IN21K 32.9 29.1 30.3 30.2

Open-vocabulary LVIS with CenterNet2

name backbone box AP_novel box mAP mask AP_novel mask mAP
DECOLA ResNet-50 29.5 37.7 27.0 33.7
DECOLA Swin-B 38.4 46.7 35.3 42.0

Standard LVIS with Deformable DETR

name backbone box AP_rare box mAP
baseline ResNet-50 26.3 35.6
+ self-train ResNet-50 30.0 36.6
DECOLA (ours) ResNet-50 35.9 39.4
baseline Swin-B 38.3 44.5
+ self-train Swin-B 42.0 45.2
DECOLA (ours) Swin-B 47.4 48.3
baseline Swin-L 49.3 54.4
+ self-train Swin-L 48.7 53.4
DECOLA (ours) Swin-L 54.9 56.4

Standard LVIS with CenterNet2

name backbone box AP_rare box mAP mask AP_rare mask mAP
DECOLA (ours) ResNet-50 35.6 38.6 32.1 34.4
DECOLA (ours) Swin-B 47.6 48.5 43.7 43.6

Analyzing DECOLA

Here we provide code for analyses of our model as well as baselines.


The majority of DECOLA is licensed under the Apache 2.0 license. However, this work largely builds off of Detic, Deformable DETR, and Detectron2. We also provide optional integration with Segment Anything Model. Please refer to their original licenses for more details.


If you find this project useful for your research, please cite our paper using the following bibtex.

    author    = {Cho, Jang Hyun and Kr\"ahenb\"uhl, Philipp},
    title     = {Language-conditioned Detection Transformer},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {16593-16603}