wanghao9610 / OV-DINO

Official implementation of OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion
https://wanghao9610.github.io/OV-DINO
Apache License 2.0
255 stars 14 forks source link
fundation-models object-detection open-vocabulary-detection open-vocabulary-segmentation open-world ov-dino zero-shot-object-detection

πŸ¦– OV-DINO

Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

[Hao Wang](https://github.com/wanghao9610)1,2,[Pengzhen Ren](https://scholar.google.com/citations?user=yVxSn70AAAAJ&hl)1,[Zequn Jie](https://scholar.google.com/citations?user=4sKGNB0AAAAJ&hl)3, [Xiao Dong](https://scholar.google.com.sg/citations?user=jXLkbw8AAAAJ&hl)1, [Chengjian Feng](https://fcjian.github.io/)3, [Yinlong Qian](https://scholar.google.com/citations?user=8tPN5CAAAAAJ&hl)3, [Lin Ma](https://forestlinma.com/)3, [Dongmei Jiang](https://scholar.google.com/citations?user=Awsue7sAAAAJ&hl)2, [Yaowei Wang](https://scholar.google.com/citations?user=o_DllmIAAAAJ&hl)2,4, [Xiangyuan Lan](https://scholar.google.com/citations?user=c3iwWRcAAAAJ&hl)2:email:, [Xiaodan Liang](https://scholar.google.com/citations?user=voxznZAAAAAJ&hl)1,2:email: 1 Sun Yat-sen University, 2 Pengcheng Lab, 3 Meituan Inc, 4 HIT, Shenzhen :email: corresponding author. [[`Paper`](https://arxiv.org/abs/2407.07844)] [[`HuggingFace`](https://huggingface.co/hao9610/ov-dino-tiny)] [[`Demo`](http://47.115.200.157:7860)] [[`BibTex`](#pushpin-citation)]
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ov-dino-unified-open-vocabulary-detection/zero-shot-object-detection-on-mscoco)](https://paperswithcode.com/sota/zero-shot-object-detection-on-mscoco?p=ov-dino-unified-open-vocabulary-detection) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ov-dino-unified-open-vocabulary-detection/zero-shot-object-detection-on-lvis-v1-0)](https://paperswithcode.com/sota/zero-shot-object-detection-on-lvis-v1-0?p=ov-dino-unified-open-vocabulary-detection) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ov-dino-unified-open-vocabulary-detection/zero-shot-object-detection-on-lvis-v1-0-val)](https://paperswithcode.com/sota/zero-shot-object-detection-on-lvis-v1-0-val?p=ov-dino-unified-open-vocabulary-detection)

:fire: Updates

:rocket: Introduction

This project contains the official PyTorch implementation, pre-trained models, fine-tuning code, and inference demo for OV-DINO.

:page_facing_up: Overview

:sparkles: Model Zoo

Model Pre-Train Data APmv APr APc APf APval APr APc APf APcoco Weights
OV-DINO1 O365 24.4 15.5 20.3 29.7 18.7 9.3 14.5 27.4 49.5 / 57.5 CKPT / LOG πŸ€—
OV-DINO2 O365,GoldG 39.4 32.0 38.7 41.3 32.2 26.2 30.1 37.3 50.6 / 58.4 CKPT πŸ€—
OV-DINO3 O365,GoldG,CC1M 40.1 34.5 39.5 41.5 32.9 29.1 30.4 37.4 50.2 / 58.2 CKPT πŸ€—

NOTE: APmv denotes the zero-shot evaluation results on LVIS MiniVal, APval denotes the zero-shot evaluation results on LVIS Val, APcoco denotes (zero-shot / fine-tune) evaluation results on COCO, respectively.

:checkered_flag: Getting Started

1. Project Structure

OV-DINO
β”œβ”€β”€ datas
β”‚Β Β  β”œβ”€β”€ o365
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”œβ”€β”€ val
β”‚   β”‚   └── test
β”‚Β Β  β”œβ”€β”€ coco
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”œβ”€β”€ train2017
β”‚   β”‚   └── val2017
β”‚   β”œβ”€β”€ lvis
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”œβ”€β”€ train2017
β”‚   β”‚   └── val2017
β”‚   └── custom
β”‚       β”œβ”€β”€ annotations
β”‚       β”œβ”€β”€ train
β”‚       └── val
β”œβ”€β”€ docs
β”œβ”€β”€ inits
β”‚Β Β  β”œβ”€β”€ huggingface
β”‚Β Β  β”œβ”€β”€ ovdino
β”‚Β Β  β”œβ”€β”€ sam2
β”‚Β Β  └── swin
β”œβ”€β”€ ovdino
β”‚Β Β  β”œβ”€β”€ configs
β”‚Β Β  β”œβ”€β”€ demo
β”‚Β Β  β”œβ”€β”€ detectron2-717ab9
β”‚Β Β  β”œβ”€β”€ detrex
β”‚Β Β  β”œβ”€β”€ projects
β”‚Β Β  β”œβ”€β”€ scripts
β”‚Β Β  └── tools
β”œβ”€β”€ wkdrs
β”‚   β”œβ”€β”€ ...
β”‚

2. Installation

# clone this project
git clone https://github.com/wanghao9610/OV-DINO.git
cd OV-DINO
export root_dir=$(realpath ./)
cd $root_dir/ovdino

# Optional: set CUDA_HOME for cuda11.6.
# OV-DINO utilizes the cuda11.6 default, if your cuda is not cuda11.6, you need first export CUDA_HOME env manually.
export CUDA_HOME="your_cuda11.6_path"
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
echo -e "$log_format cuda version:\n$(nvcc -V)"

# create conda env for ov-dino
conda create -n ovdino -y
conda activate ovdino
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia -y
conda install gcc=9 gxx=9 -c conda-forge -y # Optional: install gcc9
python -m pip install -e detectron2-717ab9
pip install -e ./

# Optional: create conda env for ov-sam, it may not compatible with ov-dino, so we create a new env.
# ov-sam = ov-dino + sam2
conda create -n ovsam -y
conda activate ovsam
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia -y
# install the sam2 following the sam2 project.
# please refer to https://github.com/facebookresearch/segment-anything-2.git
# download sam2 checkpoints and put them to inits/sam2
python -m pip install -e detectron2-717ab9
pip install -e ./

2. Data Preparing

COCO

LVIS

Objects365

Zero-Shot Evaluation on COCO Benchmark

cd $root_dir/ovdino
# Evaluation mean ap on COCO dataset.
bash scripts/eval.sh \
  projects/ovdino/configs/ovdino_swin_tiny224_bert_base_eval_coco.py \
  ../inits/ovdino/ovdino_swint_og-coco50.6_lvismv39.4_lvis32.2.pth \
  ../wkdrs/eval_ovdino

Zero-Shot Evaluation on LVIS Benchmark

cd $root_dir/ovdino
# Evaluation of fixed_ap on LVIS MiniVal dataset.
bash scripts/eval.sh \
  projects/ovdino/configs/ovdino_swin_tiny224_bert_base_eval_lvismv.py \
  ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \
  ../wkdrs/eval_ovdino

# Evaluation of fixed_ap on the LVIS Val dataset. 
# It will require about 250GB of memory due to the large number of samples in the LVIS Val dataset, so please ensure that your machine has enough memory.
bash scripts/eval.sh \
  projects/ovdino/configs/ovdino_swin_tiny224_bert_base_eval_lvis.py \
  ../inits/ovdino/ovdino_swint_ogc-coco50.2_lvismv40.1_lvis32.9.pth \
  ../wkdrs/eval_ovdino

4. Fine-Tuning

Fine-Tuning on COCO Dataset

cd $root_dir/ovdino
bash scripts/finetune.sh \
  projects/ovdino/configs/ovdino_swin_tiny224_bert_base_ft_coco_24ep.py \
  ../inits/ovdino/ovdino_swint_og-coco50.6_lvismv39.4_lvis32.2.pth

Fine-Tuning on Custom Dataset

Pre-Training on [Objects365, GoldG, CC1M‑] datasets

Coming soon ...

NOTE: We will release the all pre-training code after our paper is accepted.

:computer: Demo

:white_check_mark: TODO

:blush: Acknowledge

This project has referenced some excellent open-sourced repos (Detectron2, detrex, GLIP, G-DINO, YOLO-World). Thanks for their wonderful works and contributions to the community.

:pushpin: Citation

If you find OV-DINO is helpful for your research or applications, please consider giving us a star 🌟 and citing it by the following BibTex entry.

@article{wang2024ovdino,
  title={OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion}, 
  author={Hao Wang and Pengzhen Ren and Zequn Jie and Xiao Dong and Chengjian Feng and Yinlong Qian and Lin Ma and Dongmei Jiang and Yaowei Wang and Xiangyuan Lan and Xiaodan Liang},
  journal={arXiv preprint arXiv:2407.07844},
  year={2024}
}