Rohit K Bharadwaj, Muzammal Naseer, Salman Khan, Fahad Khan
Official code for our paper "Enhancing Novel Object Detection via Cooperative Foundational Models"
Abstract: In this work, we address the challenging and emergent problem of novel object detection (NOD), focusing on the accurate detection of both known and novel object categories during inference. Traditional object detection algorithms are inherently closed-set, limiting their capability to handle NOD. We present a novel approach to transform existing closed-set detectors into open-set detectors. This transformation is achieved by leveraging the complementary strengths of pre-trained foundational models, specifically CLIP and SAM, through our cooperative mechanism. Furthermore, by integrating this mechanism with state-of-the-art open-set detectors such as GDINO, we establish new benchmarks in object detection performance. Our method achieves 17.42 mAP in novel object detection and 42.08 mAP for known objects on the challenging LVIS dataset. Adapting our approach to the COCO OVD split, we surpass the current state-of-the-art by a margin of 7.2 AP50 for novel classes.
We have used python=3.8.15
, and torch=1.10.1
for all the code in this repository. It is recommended to follow the below steps and setup your conda environment in the same way to replicate the results mentioned in this paper and repository.
git clone git@github.com:rohit901/cooperative-foundational-models.git
or
git clone https://github.com/rohit901/cooperative-foundational-models.git
cd cooperative-foundational-models
conda env create -f environment.yml
conda activate coop_foundation_models
python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
To download and setup the required datasets used in this work, please follow these steps:
2017 Train images
, 2017 Val images
, 2017 Test images
, and their annotation files 2017 Train/Val annotations
.ovd_instances_train2017_base.json
, and ovd_instances_val2017_basetarget.json
.lvis_val_subset
dataset from: LVIS-Val-Subset, specifically download lvis_v1_val_subset.json
.Detectron2 requires you to setup the datasets in a specific folder format/structure, for that it uses the environment variable DETECTRON2_DATASETS
which is set equal to the path of the location containing all the different datasets. The file structure of DETECTRON2_DATASETS
should be as follows:
coco/
annotations/
instances_train2017.json
instances_val2017.json
ovd_instances_train2017_base.json
ovd_instances_val2017_basetarget.json
..other coco annotation json files (optional)..
train2017/
val2017/
test2017/
lvis/
lvis_v1_val.json
lvis_v1_train.json
lvis_v1_val_subset.json
The above file structure can also be seen from this onedrive link: link. Thus, the value for DETECTRON2_DATASETS
or detectron2_dir
in our code file should be the absolute path to the datasets
directory which follows the above structure.
All the pre-trained model weights can be downloaded from this link: model weights. The folder contains the following model weights:
gdino_checkpoint
in params.json
file to point to this file.sam_checkpoint
in params.json
file to point to this file.rcnn_weight_dir
in scripts/novel_object_detection/params.json
file to point to this folder.CHECKPOINT_PATH
in scripts/open_vocab_detection/train_mask_rcnn/train.batch
to point to this file.rcnn_weight_dir
in scripts/open_vocab_detection/evaluate_method/params.json
file to point to this folder.Method | Mask-RCNN | GDINO | VLM | Novel AP | Known AP | All AP |
---|---|---|---|---|---|---|
K-Means | - | - | - | 0.20 | 17.77 | 1.55 |
Weng et al | - | - | - | 0.27 | 17.85 | 1.62 |
ORCA | - | - | - | 0.49 | 20.57 | 2.03 |
UNO | - | - | - | 0.61 | 21.09 | 2.18 |
RNCDL | V1 | - | - | 5.42 | 25.00 | 6.92 |
GDINO | - | ✔ | - | 13.47 | 37.13 | 15.30 |
Ours | V2 | ✔ | SigLIP | 17.42 | 42.08 | 19.33 |
Table 1: Comparison of object detection performance using mAP on the lvis_val dataset.
To replicate our results from the above table (i.e. Table 1 from the main paper):
Modify scripts/novel_object_detection/params.json
file:
detectron2_dir
and set it following instructions in Datasetssam_checkpoint
and set the path to the downloaded file SAM_weights.pth
gdino_checkpoint
and set the path to the downloaded file GDINO_weights.pth
rcnn_weight_dir
and set the path to the downloaded folder maskrcnn_v2
[NOTE: DO NOT put a trailing slash]Run the following script from the main project directory as follows:
python scripts/novel_object_detection/main.py
The above script periodically saves the predictions output in the outputs
directory which is automatically created in the project level folder (i.e. cooperative-foundational-models/outputs
). After executing the above script, the results will be printed to the console. Further, the final combined predictions of all the 19809 images in LVIS val dataset is saved as instances_predictions.pth
, and can be used with scripts/novel_object_detection/evaluate_results_from_predictions.py
to compute the final results.
NOTE: We were able to get slightly better overall result with our method using the code in this repository compared to the reported results in the paper: | Method | Known AP | Novel AP | ALL AP |
---|---|---|---|---|
Ours (Paper) | 42.08 | 17.42 | 19.33 | |
Ours (GitHub) | 45.43 | 17.25 | 19.43 |
To detect LVIS class vocab (1203 classes) on your custom images:
params.json
, and the environment.python scripts/novel_object_detection/inference_single_image.py --image_path custom_image.jpg
, you can replace custom_image.jpg with your own image and change the path accordingly.The above script by default generates bounding box visualization of top-5 high scoring boxes. You may change the top-k visualization parameter by modifying the script. Alternatively, you may also choose to visualize the outputs based on confidence score threshold.
Method | Backbone | Use Extra Training Set | Novel AP50 |
---|---|---|---|
OVR-CNN | RN50 | ✔ | 22.8 |
ViLD | ViT-B/32 | ✘ | 27.6 |
Detic | RN50 | ✔ | 27.8 |
OV-DETR | ViT-B/32 | ✘ | 29.4 |
BARON | RN50 | ✘ | 34 |
Rasheed et al | RN50 | ✔ | 36.6 |
CORA | RN50x4 | ✘ | 41.7 |
BARON | RN50 | ✔ | 42.7 |
CORA+ | RN50x4 | ✔ | 43.1 |
**Ours*** | RN101 + SwinT | ✘ | 50.3 |
Table 2: Results on COCO OVD benchmark. *Our approach with GDINO, SigLIP, and Mask-RCNN trained on COCO OVD split.
To replicate our results from the above table (i.e. Table 2 from the main paper):
Obtain the trained Mask-RCNN model weights on COCO OVD dataset split.
DETECTRON2_DATASETS
, CHECKPOINT_PATH
in scripts/open_vocab_detection/train_mask_rcnn/train.batch
bash scripts/open_vocab_detection/train_mask_rcnn/train.batch
detectron2_dir
, sam_checkpoint
, gdino_checkpoint
, and rcnn_weight_dir
values in scripts/open_vocab_detection/evaluate_method/params.json
accordingly. For rcnn_weight_dir
set the path to the downloaded folder MaskRCNN_COCO_OVD
without trailing slash.Run the following script from main project directory as follows:
python scripts/open_vocab_detection/evaluate_method/main.py
After executing the above script, the results will be displayed on the console. Ensure you follow the proper installation and setup steps mentioned in Datasets, and Model Weights.
RNCDL | GDINO | RCNN_CLIP | Ours |
---|---|---|---|
To see additional and higher resolution visualizations, please visit the project website
Should you have any questions, please create an issue in this repository or contact at rohit.bharadwaj@mbzuai.ac.ae
We thank the authors of GDINO, SAM, CLIP, and RNCDL for releasing their code.
If you found our work helpful, please consider starring the repository ⭐⭐⭐ and citing our work as follows:
@misc{bharadwaj2023enhancing,
title={Enhancing Novel Object Detection via Cooperative Foundational Models},
author={Rohit Bharadwaj and Muzammal Naseer and Salman Khan and Fahad Shahbaz Khan},
year={2023},
eprint={2311.12068},
archivePrefix={arXiv},
primaryClass={cs.CV}
}