An Extendable, Efficient and Effective Transformer-based Object Detector (Extension of VIDT published at ICLR2022)

Please see the vidt branch if you are interested in the vanilla ViDT model.
This is an extension of ViDT for joint-learning of object detection and instance segmentation.

by Hwanjun Song¹, Deqing Sun², Sanghyuk Chun¹, Varun Jampani², Dongyoon Han¹,
Byeongho Heo¹, Wonjae Kim¹, and Ming-Hsuan Yang^2,3

¹ NAVER AI Lab, ² Google Research, ³ University California Merced

April 6, 2022: The official code is released!
We obtained a light-weight transformer-based detector, achieving 47.0AP only with 14M parameters and 41.9 FPS (NVIDIA A100).
See Complete Analysis.
April 19, 2022: The preprint is uploaded, See [here]!

ViDT+ for Joint-learning of Object Detection and Instance Segmentation

Extension to ViDT+

We extend ViDT into ViDT+, supporting a joint-learning of object detection and instance segmentation in an end-to-end manner. Three new components have been leveraged for extensions: (1) An efficient pyramid feature fusion (EPFF) module, (2) An unified query representation module, and (3) two auxiliary losses of IoU-aware and token labeling. Compared with the vanilla ViDT, ViDT+ provides a significant performance improvement without comprising inference speed. Only 1M parameters are added into the model.

Evaluation

Index: [A. ViT Backbone], [B. Main Results], [C. Complete Analysis]

|--- A. ViT Backbone used for ViDT
|--- B. Main Results in the ViDT+ Paper
     |--- B.1. VIDT+ compared with the vanilla ViDT for Object Detection
     |--- B.2. VIDT+ compared with other CNN-based methods for Object Detection and Instance Segmentation
|--- C. Complete Component Analysis

A. ViT Backbone used for ViDT+

Backbone and Size	Training Data	Epochs	Resulution	Params	ImageNet Acc.	Checkpoint
`Swin-nano`	ImageNet-1K	300	224	6M	74.9%	Github
`Swin-tiny`	ImageNet-1K	300	224	28M	81.2%	Github
`Swin-small`	ImageNet-1K	300	224	50M	83.2%	Github
`Swin-base`	ImageNet-22K	90	224	88M	86.3%	Github

B. Main Results in the ViDT+ Paper

All the models were re-trained with the final version of source codes. Thus, the value may be very slightly different from those in the paper. Note that a single 'NVIDIA A100 GPU' was used to compute FPS for the input of batch size 1.
Compared with the vailla version, ViDT+ leverages three additional components or techniques:
(1) An efficient pyramid feature fusion (EPFF) module.
(2) An unified query representation moudle (UQR).
(3) Two additional losses of IoU-aware loss and token-labeling loss.

B.1. VIDT+ compared with the vanilla ViDT for Object Detection

Method	Backbone	Epochs	AP	AP50	AP75	AP_S	AP_M	AP_L	Params	FPS	Checkpoint / Log
ViDT+	`Swin-nano`	50	45.3	62.3	48.9	27.3	48.2	61.5	16M	37.6	Github / Log
ViDT+	`Swin-tiny`	50	49.7	67.7	54.2	31.6	53.4	65.9	38M	30.4	Github / Log
ViDT+	`Swin-small`	50	51.2	69.5	55.9	33.8	54.5	67.8	61M	20.6	Github / Log
ViDT+	`Swin-base`	50	53.2	71.6	58.3	36.0	57.1	69.2	100M	19.3	Github / Log

Method	Backbone	Epochs	AP	AP50	AP75	AP_S	AP_M	AP_L	Params	FPS	Checkpoint / Log
ViDT	`Swin-nano`	50	40.4	59.9	43.0	23.1	42.8	55.9	15M	40.8	Github / Log
ViDT	`Swin-tiny`	50	44.9	64.7	48.3	27.5	47.9	61.9	37M	33.5	Github / Log
ViDT	`Swin-small`	50	47.4	67.7	51.2	30.4	50.7	64.6	60M	24.7	Github / Log
ViDT	`Swin-base`	50	49.4	69.6	53.4	31.6	52.4	66.8	99M	20.5	Github / Log

B.2. VIDT+ compared with other CNN-based methods for Object Detection and Instance Segmentation

For fair comparison w.r.t the number of parameters, Swin-tiny and Swin-small backbones are used for ViDT+, which have similar number of parameters to ResNet-50 and ResNet-101, respectively.
ViDT+ shows much higher detection AP than other joint-learning methods, but its segmentation AP is only higher than others for the medium- and large-size objects in general.

Method	Backbone	Epochs	Box AP	Mask AP	Mask AP_S	Mask AP_M	Mask AP_L
Mask R-CNN	`ResNet-50 + FPN`	36	41.3	37.5	21.1	39.6	48.3
HTC	`ResNet-50 + FPN`	36	44.9	39.7	22.6	42.2	50.6
SOLOv2	`ResNet-50 + FPN`	72	40.4	38.8	16.5	41.7	56.2
QueryInst	`ResNet-50 + FPN`	36	45.6	40.6	23.4	42.5	52.8
SOLQ	`ResNet-50`	50	47.8	39.7	21.5	42.5	53.1
ViDT+	`Swin-tiny`	50	49.7	39.5	21.5	43.4	58.2

Method	Backbone	Epochs	Box AP	Mask AP	Mask AP_S	Mask AP_M	Mask AP_L
Mask R-CNN	`ResNet-101 + FPN`	50	41.3	38.8	21.8	41.4	50.5
HTC	`ResNet-101 + FPN`	50	44.3	40.8	23.0	43.5	58.2
SOLOv2	`ResNet-101 + FPN`	50	42.6	39.7	17.3	42.9	58.2
QueryInst	`ResNet-101 + FPN`	50	48.1	42.8	24.6	45.0	58.2
SOLQ	`ResNet-101`	50	48.7	40.9	22.5	43.8	58.2
ViDT+	`Swin-small`	50	51.2	40.8	22.6	44.3	60.1

C. Complete Component Analysis

We combined the four proposed components (even with distillation with token matching and decoding layer drop) to achieve high accuracy and speed for object detection. For distillation, ViDT (Swin-base) trained for 50 epochs was used for all models.

We combined all the proposed components (even with longer training epochs and decoding layer dropping) to achive high accuracy and speed for object detection. As summarized in below table, there are eight components for extension: (1) RAM, (2) the neck decoder, (3) the IoU-aware and token labeling losses, (4) the EPFF module, (5) the UQR module, (6) the use of more detection tokens, (6) the use of longer training epochs, and (8) decoding layer drop.

The numbers (2), (6), and (8) are the performance of the vanilla ViDT, its extension to ViDT+, and the fully optimized ViDT+.

	Added	Swin-nano			Swin-tiny			Swin-small
#	Module	AP	Params	FPS	AP	Params	FPS	AP	Params	FPS
(1)	+ RAM	28.7	7M	72.4	36.3	29M	51.8	41.6	52M	33.5
(2)	+ Encoder-free Neck	40.4	15M	40.8	44.8	37M	33.5	47.5	60M	24.7
(3)	+ IoU-aware & Token Label	41.0	15M	40.8	45.9	37M	33.5	48.5	60M	24.7
(4)	+ EPFF Module	42.5	16M	38.0	47.1	38M	30.9	49.3	61M	23.0
(5)	+ UQR Module	43.9	16M	38.0	47.9	38M	30.9	50.1	61M	23.0
(6)	+ 300 [DET] Tokens	45.3	16M	37.6	49.7	38M	30.4	51.2	61M	22.6
(7)	+ 150 Training Epochs	47.6	16M	37.6	51.4	38M	30.4	52.3	61M	22.6
(8)	+ Decoding Layer Drop	47.0	14M	41.9	50.8	36M	33.9	51.8	59M	24.6

The optimized ViDT+ models can be found:
ViDT+ (Swin-nano), ViDT+ (Swin-tiny), and ViDT+ (Swin-small).

Requirements

This codebase has been developed with the setting used in Deformable DETR:
Linux, CUDA>=9.2, GCC>=5.4, Python>=3.7, PyTorch>=1.5.1, and torchvision>=0.6.1.

We recommend you to use Anaconda to create a conda environment:

conda create -n deformable_detr python=3.7 pip
conda activate deformable_detr
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch

Compiling CUDA operators for deformable attention

cd ./ops
sh ./make.sh
# unit test (should see all checking is True)
python test.py

Other requirements

pip install -r requirements.txt

Training and Evaluation

If you want to test with a single GPU, see colab examples. Thanks to EherSenaw for making this example.
The below codes are for training with multi GPUs.

Training for ViDT+

We used the below commands to train ViDT+ models with a single node having 8 NVIDIA GPUs.

Run this command to train the ViDT+ (Swin-nano) model in the paper :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_nano \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --token_label True \
       --iou_aware True \
       --with_vector True \
       --masks True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output

Run this command to train the ViDT+ (Swin-tiny) model in the paper :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_tiny \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --token_label True \
       --iou_aware True \
       --with_vector True \
       --masks True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output

Run this command to train the ViDT+ (Swin-small) model in the paper :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_small \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --token_label True \
       --iou_aware True \
       --with_vector True \
       --masks True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output

Run this command to train the ViDT+ (Swin-base) model in the paper :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_base_win7_22k \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --token_label True \
       --iou_aware True \
       --with_vector True \
       --masks True \
       --coco_path /path/to/coco \
       --output_dir /path/for/output

Evaluation for ViDT+

Run this command to evaluate the ViDT+ (Swin-nano) model on COCO :


python -m torch.distributed.launch \
       --nproc_per_node=8 \ 
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_nano \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_nano \
       --pre_trained none \
       --eval True

Run this command to evaluate the ViDT+ (Swin-tiny) model on COCO :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_tiny \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_tiny\
       --pre_trained none \
       --eval True

Run this command to evaluate the ViDT+ (Swin-small) model on COCO :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_small \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_small \
       --pre_trained none \
       --eval True

Run this command to evaluate the ViDT+ (Swin-base) model on COCO :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_base_win7_22k \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 300 \
       --epff True \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_base \
       --pre_trained none \
       --eval True

Training for ViDT

We used the below commands to train ViDT models with a single node having 8 NVIDIA GPUs.

Run this command to train the ViDT (Swin-nano) model in the paper :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_nano \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --output_dir /path/for/output

Run this command to train the ViDT (Swin-tiny) model in the paper :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_tiny \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --output_dir /path/for/output

Run this command to train the ViDT (Swin-small) model in the paper :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_small \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --output_dir /path/for/output

Run this command to train the ViDT (Swin-base) model in the paper :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_base_win7_22k \
       --epochs 50 \
       --lr 1e-4 \
       --min-lr 1e-7 \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --output_dir /path/for/output

Evaluation for ViDT

Run this command to evaluate the ViDT (Swin-nano) model on COCO :


python -m torch.distributed.launch \
       --nproc_per_node=8 \ 
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_nano \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_nano \
       --pre_trained none \
       --eval True

Run this command to evaluate the ViDT (Swin-tiny) model on COCO :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_tiny \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_tiny\
       --pre_trained none \
       --eval True

Run this command to evaluate the ViDT (Swin-small) model on COCO :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_small \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_small \
       --pre_trained none \
       --eval True

Run this command to evaluate the ViDT (Swin-base) model on COCO :


python -m torch.distributed.launch \
       --nproc_per_node=8 \
       --nnodes=1 \
       --use_env main.py \
       --method vidt \
       --backbone_name swin_base_win7_22k \
       --batch_size 2 \
       --num_workers 2 \
       --aux_loss True \
       --with_box_refine True \
       --det_token_num 100 \
       --coco_path /path/to/coco \
       --resume /path/to/vidt_base \
       --pre_trained none \
       --eval True

Citation

Please consider citation if our paper is useful in your research.

@inproceedings{song2022vidt,
  title={ViDT: An Efficient and Effective Fully Transformer-based Object Detector},
  author={Song, Hwanjun and Sun, Deqing and Chun, Sanghyuk and Jampani, Varun and Han, Dongyoon and Heo, Byeongho and Kim, Wonjae and Yang, Ming-Hsuan},
  booktitle={International Conference on Learning Representation},
  year={2022}
}

@article{song2022vidtplus,
  title={An Extendable, Efficient and Effective Transformer-based Object Detector},
  author={Song, Hwanjun and Sun, Deqing and Chun, Sanghyuk and Jampani, Varun and Han, Dongyoon and Heo, Byeongho and Kim, Wonjae and Yang, Ming-Hsuan},
  journal={arXiv preprint arXiv:2204.07962},
  year={2022}
}

License

Copyright 2021-present NAVER Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

naver-ai / vidt

readme