Please see the vidt branch if you are interested in the vanilla ViDT model. This is an extension of ViDT for joint-learning of object detection and instance segmentation.
by Hwanjun Song1, Deqing Sun2, Sanghyuk Chun1, Varun Jampani2, Dongyoon Han1,
Byeongho Heo1, Wonjae Kim1, and Ming-Hsuan Yang2,3
1 NAVER AI Lab, 2 Google Research, 3 University California Merced
April 6, 2022
: The official code is released! We obtained a light-weight transformer-based detector, achieving 47.0AP only with 14M parameters and 41.9 FPS (NVIDIA A100). See Complete Analysis.April 19, 2022
: The preprint is uploaded, See [here]!
We extend ViDT into ViDT+, supporting a joint-learning of object detection and instance segmentation in an end-to-end manner. Three new components have been leveraged for extensions: (1) An efficient pyramid feature fusion (EPFF) module, (2) An unified query representation module, and (3) two auxiliary losses of IoU-aware and token labeling. Compared with the vanilla ViDT, ViDT+ provides a significant performance improvement without comprising inference speed. Only 1M parameters are added into the model.
Index: [A. ViT Backbone], [B. Main Results], [C. Complete Analysis]
|--- A. ViT Backbone used for ViDT
|--- B. Main Results in the ViDT+ Paper
|--- B.1. VIDT+ compared with the vanilla ViDT for Object Detection
|--- B.2. VIDT+ compared with other CNN-based methods for Object Detection and Instance Segmentation
|--- C. Complete Component Analysis
Backbone and Size | Training Data | Epochs | Resulution | Params | ImageNet Acc. | Checkpoint |
---|---|---|---|---|---|---|
Swin-nano |
ImageNet-1K | 300 | 224 | 6M | 74.9% | Github |
Swin-tiny |
ImageNet-1K | 300 | 224 | 28M | 81.2% | Github |
Swin-small |
ImageNet-1K | 300 | 224 | 50M | 83.2% | Github |
Swin-base |
ImageNet-22K | 90 | 224 | 88M | 86.3% | Github |
All the models were re-trained with the final version of source codes. Thus, the value may be very slightly different from those in the paper. Note that a single 'NVIDIA A100 GPU' was used to compute FPS for the input of batch size 1.
Compared with the vailla version, ViDT+ leverages three additional components or techniques:
(1) An efficient pyramid feature fusion (EPFF) module.
(2) An unified query representation moudle (UQR).
(3) Two additional losses of IoU-aware loss and token-labeling loss.
Method | Backbone | Epochs | AP | AP50 | AP75 | AP_S | AP_M | AP_L | Params | FPS | Checkpoint / Log |
---|---|---|---|---|---|---|---|---|---|---|---|
ViDT+ | Swin-nano |
50 | 45.3 | 62.3 | 48.9 | 27.3 | 48.2 | 61.5 | 16M | 37.6 | Github / Log |
ViDT+ | Swin-tiny |
50 | 49.7 | 67.7 | 54.2 | 31.6 | 53.4 | 65.9 | 38M | 30.4 | Github / Log |
ViDT+ | Swin-small |
50 | 51.2 | 69.5 | 55.9 | 33.8 | 54.5 | 67.8 | 61M | 20.6 | Github / Log |
ViDT+ | Swin-base |
50 | 53.2 | 71.6 | 58.3 | 36.0 | 57.1 | 69.2 | 100M | 19.3 | Github / Log |
Method | Backbone | Epochs | AP | AP50 | AP75 | AP_S | AP_M | AP_L | Params | FPS | Checkpoint / Log |
---|---|---|---|---|---|---|---|---|---|---|---|
ViDT | Swin-nano |
50 | 40.4 | 59.9 | 43.0 | 23.1 | 42.8 | 55.9 | 15M | 40.8 | Github / Log |
ViDT | Swin-tiny |
50 | 44.9 | 64.7 | 48.3 | 27.5 | 47.9 | 61.9 | 37M | 33.5 | Github / Log |
ViDT | Swin-small |
50 | 47.4 | 67.7 | 51.2 | 30.4 | 50.7 | 64.6 | 60M | 24.7 | Github / Log |
ViDT | Swin-base |
50 | 49.4 | 69.6 | 53.4 | 31.6 | 52.4 | 66.8 | 99M | 20.5 | Github / Log |
For fair comparison w.r.t the number of parameters, Swin-tiny and Swin-small backbones are used for ViDT+, which have similar number of parameters to ResNet-50 and ResNet-101, respectively. ViDT+ shows much higher detection AP than other joint-learning methods, but its segmentation AP is only higher than others for the medium- and large-size objects in general.
Method | Backbone | Epochs | Box AP | Mask AP | Mask AP_S | Mask AP_M | Mask AP_L |
---|---|---|---|---|---|---|---|
Mask R-CNN | ResNet-50 + FPN |
36 | 41.3 | 37.5 | 21.1 | 39.6 | 48.3 |
HTC | ResNet-50 + FPN |
36 | 44.9 | 39.7 | 22.6 | 42.2 | 50.6 |
SOLOv2 | ResNet-50 + FPN |
72 | 40.4 | 38.8 | 16.5 | 41.7 | 56.2 |
QueryInst | ResNet-50 + FPN |
36 | 45.6 | 40.6 | 23.4 | 42.5 | 52.8 |
SOLQ | ResNet-50 |
50 | 47.8 | 39.7 | 21.5 | 42.5 | 53.1 |
ViDT+ | Swin-tiny |
50 | 49.7 | 39.5 | 21.5 | 43.4 | 58.2 |
Method | Backbone | Epochs | Box AP | Mask AP | Mask AP_S | Mask AP_M | Mask AP_L |
---|---|---|---|---|---|---|---|
Mask R-CNN | ResNet-101 + FPN |
50 | 41.3 | 38.8 | 21.8 | 41.4 | 50.5 |
HTC | ResNet-101 + FPN |
50 | 44.3 | 40.8 | 23.0 | 43.5 | 58.2 |
SOLOv2 | ResNet-101 + FPN |
50 | 42.6 | 39.7 | 17.3 | 42.9 | 58.2 |
QueryInst | ResNet-101 + FPN |
50 | 48.1 | 42.8 | 24.6 | 45.0 | 58.2 |
SOLQ | ResNet-101 |
50 | 48.7 | 40.9 | 22.5 | 43.8 | 58.2 |
ViDT+ | Swin-small |
50 | 51.2 | 40.8 | 22.6 | 44.3 | 60.1 |
We combined the four proposed components (even with distillation with token matching and decoding layer drop) to achieve high accuracy and speed for object detection. For distillation, ViDT (Swin-base) trained for 50 epochs was used for all models.
We combined all the proposed components (even with longer training epochs and decoding layer dropping) to achive high accuracy and speed for object detection. As summarized in below table, there are eight components for extension: (1) RAM, (2) the neck decoder, (3) the IoU-aware and token labeling losses, (4) the EPFF module, (5) the UQR module, (6) the use of more detection tokens, (6) the use of longer training epochs, and (8) decoding layer drop.
The numbers (2), (6), and (8) are the performance of the vanilla ViDT, its extension to ViDT+, and the fully optimized ViDT+.
Added | Swin-nano | Swin-tiny | Swin-small | |||||||
---|---|---|---|---|---|---|---|---|---|---|
# | Module | AP | Params | FPS | AP | Params | FPS | AP | Params | FPS |
(1) | + RAM | 28.7 | 7M | 72.4 | 36.3 | 29M | 51.8 | 41.6 | 52M | 33.5 |
(2) | + Encoder-free Neck | 40.4 | 15M | 40.8 | 44.8 | 37M | 33.5 | 47.5 | 60M | 24.7 |
(3) | + IoU-aware & Token Label | 41.0 | 15M | 40.8 | 45.9 | 37M | 33.5 | 48.5 | 60M | 24.7 |
(4) | + EPFF Module | 42.5 | 16M | 38.0 | 47.1 | 38M | 30.9 | 49.3 | 61M | 23.0 |
(5) | + UQR Module | 43.9 | 16M | 38.0 | 47.9 | 38M | 30.9 | 50.1 | 61M | 23.0 |
(6) | + 300 [DET] Tokens | 45.3 | 16M | 37.6 | 49.7 | 38M | 30.4 | 51.2 | 61M | 22.6 |
(7) | + 150 Training Epochs | 47.6 | 16M | 37.6 | 51.4 | 38M | 30.4 | 52.3 | 61M | 22.6 |
(8) | + Decoding Layer Drop | 47.0 | 14M | 41.9 | 50.8 | 36M | 33.9 | 51.8 | 59M | 24.6 |
The optimized ViDT+ models can be found: ViDT+ (Swin-nano), ViDT+ (Swin-tiny), and ViDT+ (Swin-small).
This codebase has been developed with the setting used in Deformable DETR:
Linux, CUDA>=9.2, GCC>=5.4, Python>=3.7, PyTorch>=1.5.1, and torchvision>=0.6.1.
We recommend you to use Anaconda to create a conda environment:
conda create -n deformable_detr python=3.7 pip
conda activate deformable_detr
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch
cd ./ops
sh ./make.sh
# unit test (should see all checking is True)
python test.py
pip install -r requirements.txt
If you want to test with a single GPU, see colab examples. Thanks to EherSenaw for making this example.
The below codes are for training with multi GPUs.
We used the below commands to train ViDT+ models with a single node having 8 NVIDIA GPUs.
ViDT+ (Swin-nano)
model in the paper :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_nano \
--epochs 50 \
--lr 1e-4 \
--min-lr 1e-7 \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 300 \
--epff True \
--token_label True \
--iou_aware True \
--with_vector True \
--masks True \
--coco_path /path/to/coco \
--output_dir /path/for/output
ViDT+ (Swin-tiny)
model in the paper :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_tiny \
--epochs 50 \
--lr 1e-4 \
--min-lr 1e-7 \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 300 \
--epff True \
--token_label True \
--iou_aware True \
--with_vector True \
--masks True \
--coco_path /path/to/coco \
--output_dir /path/for/output
ViDT+ (Swin-small)
model in the paper :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_small \
--epochs 50 \
--lr 1e-4 \
--min-lr 1e-7 \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 300 \
--epff True \
--token_label True \
--iou_aware True \
--with_vector True \
--masks True \
--coco_path /path/to/coco \
--output_dir /path/for/output
ViDT+ (Swin-base)
model in the paper :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_base_win7_22k \
--epochs 50 \
--lr 1e-4 \
--min-lr 1e-7 \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 300 \
--epff True \
--token_label True \
--iou_aware True \
--with_vector True \
--masks True \
--coco_path /path/to/coco \
--output_dir /path/for/output
ViDT+ (Swin-nano)
model on COCO :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_nano \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 300 \
--epff True \
--coco_path /path/to/coco \
--resume /path/to/vidt_nano \
--pre_trained none \
--eval True
ViDT+ (Swin-tiny)
model on COCO :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_tiny \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 300 \
--epff True \
--coco_path /path/to/coco \
--resume /path/to/vidt_tiny\
--pre_trained none \
--eval True
ViDT+ (Swin-small)
model on COCO :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_small \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 300 \
--epff True \
--coco_path /path/to/coco \
--resume /path/to/vidt_small \
--pre_trained none \
--eval True
ViDT+ (Swin-base)
model on COCO :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_base_win7_22k \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 300 \
--epff True \
--coco_path /path/to/coco \
--resume /path/to/vidt_base \
--pre_trained none \
--eval True
We used the below commands to train ViDT models with a single node having 8 NVIDIA GPUs.
ViDT (Swin-nano)
model in the paper :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_nano \
--epochs 50 \
--lr 1e-4 \
--min-lr 1e-7 \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 100 \
--coco_path /path/to/coco \
--output_dir /path/for/output
ViDT (Swin-tiny)
model in the paper :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_tiny \
--epochs 50 \
--lr 1e-4 \
--min-lr 1e-7 \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 100 \
--coco_path /path/to/coco \
--output_dir /path/for/output
ViDT (Swin-small)
model in the paper :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_small \
--epochs 50 \
--lr 1e-4 \
--min-lr 1e-7 \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 100 \
--coco_path /path/to/coco \
--output_dir /path/for/output
ViDT (Swin-base)
model in the paper :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_base_win7_22k \
--epochs 50 \
--lr 1e-4 \
--min-lr 1e-7 \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 100 \
--coco_path /path/to/coco \
--output_dir /path/for/output
ViDT (Swin-nano)
model on COCO :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_nano \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 100 \
--coco_path /path/to/coco \
--resume /path/to/vidt_nano \
--pre_trained none \
--eval True
ViDT (Swin-tiny)
model on COCO :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_tiny \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 100 \
--coco_path /path/to/coco \
--resume /path/to/vidt_tiny\
--pre_trained none \
--eval True
ViDT (Swin-small)
model on COCO :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_small \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 100 \
--coco_path /path/to/coco \
--resume /path/to/vidt_small \
--pre_trained none \
--eval True
ViDT (Swin-base)
model on COCO :
python -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--use_env main.py \
--method vidt \
--backbone_name swin_base_win7_22k \
--batch_size 2 \
--num_workers 2 \
--aux_loss True \
--with_box_refine True \
--det_token_num 100 \
--coco_path /path/to/coco \
--resume /path/to/vidt_base \
--pre_trained none \
--eval True
Please consider citation if our paper is useful in your research.
@inproceedings{song2022vidt,
title={ViDT: An Efficient and Effective Fully Transformer-based Object Detector},
author={Song, Hwanjun and Sun, Deqing and Chun, Sanghyuk and Jampani, Varun and Han, Dongyoon and Heo, Byeongho and Kim, Wonjae and Yang, Ming-Hsuan},
booktitle={International Conference on Learning Representation},
year={2022}
}
@article{song2022vidtplus,
title={An Extendable, Efficient and Effective Transformer-based Object Detector},
author={Song, Hwanjun and Sun, Deqing and Chun, Sanghyuk and Jampani, Varun and Han, Dongyoon and Heo, Byeongho and Kim, Wonjae and Yang, Ming-Hsuan},
journal={arXiv preprint arXiv:2204.07962},
year={2022}
}
Copyright 2021-present NAVER Corp.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.