VarifocalNet: An IoU-aware Dense Object Detector

This repo hosts the code for implementing the VarifocalNet, as presented in our CVPR 2021 oral paper, which is available at: https://arxiv.org/abs/2008.13367:

@inproceedings{zhang2020varifocalnet,
  title={VarifocalNet: An IoU-aware Dense Object Detector},
  author={Zhang, Haoyang and Wang, Ying and Dayoub, Feras and S{\"u}nderhauf, Niko},
  booktitle={CVPR},
  year={2021}
}

Introduction

Accurately ranking the vast number of candidate detections is crucial for dense object detectors to achieve high performance. In this work, we propose to learn IoU-aware classification scores (IACS) that simultaneously represent the object presence confidence and localization accuracy, to produce a more accurate ranking of detections in dense object detectors. In particular, we design a new loss function, named Varifocal Loss (VFL), for training a dense object detector to predict the IACS, and a new efficient star-shaped bounding box feature representation (the features at nine yellow sampling points) for estimating the IACS and refining coarse bounding boxes. Combining these two new components and a bounding box refinement branch, we build a new IoU-aware dense object detector based on the FCOS+ATSS architecture, what we call VarifocalNet or VFNet for short. Extensive experiments on MS COCO benchmark show that our VFNet consistently surpasses the strong baseline by ~2.0 AP with different backbones. Our best model VFNet-X-1200 with Res2Net-101-DCN reaches a single-model single-scale AP of 55.1 on COCO test-dev, achieving the state-of-the-art performance among various object detectors.

Learning to Predict the IoU-aware Classification Score.

Updates

2021.03.05 Our VarifocalNet is accepted to CVPR 2021 as an oral presentation. Thanks the reviewers and ACs.
2021.03.04 Update to MMDetection v2.10.0, add more results and training scripts, and update the arXiv paper.
2021.01.09 Add SWA training.
2021.01.07 Update to MMDetection v2.8.0.
2020.12.24 We release a new VFNet-X model that can achieve a single-model single-scale 55.1 AP on COCO test-dev at 4.2 FPS.
2020.12.02 Update to MMDetection v2.7.0.
2020.10.29 VarifocalNet has been merged into the official MMDetection repo. Many thanks to @yhcao6, @RyanXLi and @hellock!
2020.10.29 This repo has been refactored so that users can pull the latest updates from the upstream official MMDetection repo. The previous one can be found in the old branch.

Installation

This VarifocalNet implementation is based on MMDetection. Therefore the installation is the same as original MMDetection.
Please check get_started.md for installation. Note that you should change the version of PyTorch and CUDA to yours when installing mmcv in step 3 and clone this repo instead of MMdetection in step 4.

If you run into problems with pycocotools, please install it by:

pip install "git+https://github.com/open-mmlab/cocoapi.git#subdirectory=pycocotools"

A Quick Demo

Once the installation is done, you can follow the steps below to run a quick demo.

Download the model and put it into one folder under the root directory of this project, say, checkpoints/.
Go to the root directory of this project in terminal and activate the corresponding virtual environment.

Run

python demo/image_demo.py demo/demo.jpg configs/vfnet/vfnet_r50_fpn_1x_coco.py checkpoints/vfnet_r50_1x_41.6.pth

and you should see an image with detections.

Usage of MMDetection

Please see exist_data_model.md for the basic usage of MMDetection. They also provide colab tutorial for beginners.

For troubleshooting, please refer to faq.md

Results and Models

For your convenience, we provide the following trained models. These models are trained with a mini-batch size of 16 images on 8 Nvidia V100 GPUs (2 images per GPU).

Backbone	Style	DCN	MS train	Lr schd	Inf time (fps)	box AP (val)	box AP (test-dev)	Download
R-50	pytorch	N	N	1x	19.4	41.6	41.6	model \| log
R-50	pytorch	N	Y	2x	19.3	44.5	44.8	model \| log
R-50	pytorch	Y	Y	2x	16.3	47.8	48.0	model \| log
R-101	pytorch	N	N	1x	15.5	43.0	43.6	model \| log
R-101	pytorch	N	N	2x	15.6	43.5	43.9	model \| log
R-101	pytorch	N	Y	2x	15.6	46.2	46.7	model \| log
R-101	pytorch	Y	Y	2x	12.6	49.0	49.2	model \| log
X-101-32x4d	pytorch	N	Y	2x	13.1	47.4	47.6	model \| log
X-101-32x4d	pytorch	Y	Y	2x	10.1	49.7	50.0	model \| log
X-101-64x4d	pytorch	N	Y	2x	9.2	48.2	48.5	model \| log
X-101-64x4d	pytorch	Y	Y	2x	6.7	50.4	50.8	model \| log
R2-101	pytorch	N	Y	2x	13.0	49.2	49.3	model \| log
R2-101	pytorch	Y	Y	2x	10.3	51.1	51.3	model \| log

Notes:

The MS-train maximum scale range is 1333x[480:960] (range mode) and the inference scale keeps 1333x800.
The R2-101 backbone is Res2Net-101.
DCN means using DCNv2 in both backbone and head.
The inference speed is tested with an Nvidia V100 GPU on HPC (log file).

We also provide the models of RetinaNet, FoveaBox, RepPoints and ATSS trained with the Focal Loss (FL) and our Varifocal Loss (VFL).

Method	Backbone	MS train	Lr schd	box AP (val)	Download
RetinaNet + FL	R-50	N	1x	36.5	model \| log
RetinaNet + VFL	R-50	N	1x	37.4	model \| log
FoveaBox + FL	R-50	N	1x	36.3	model \| log
FoveaBox + VFL	R-50	N	1x	37.2	model \| log
RepPoints + FL	R-50	N	1x	38.3	model \| log
RepPoints + VFL	R-50	N	1x	39.7	model \| log
ATSS + FL	R-50	N	1x	39.3	model \| log
ATSS + VFL	R-50	N	1x	40.2	model \| log

Notes:

We use 4 P100 GPUs for the training of these models (except ATSS, 8x2) with a mini-batch size of 16 images (4 images per GPU), as we found 4x4 training yielded slightly better results compared to 8x2 training.
You can find corresponding config files in configs/vfnet.
use_vfl flag in those config files controls whether to use the Varifocal Loss in training or not.

VFNet-X

Backbone	DCN	MS train	Training	Inf scale	Inf time (fps)	box AP (val)	box AP (test-dev)	Download
R2-101	Y	Y	41e + SWA 18e	1333x800	8.0	53.4	53.7	model \| config
R2-101	Y	Y	41e + SWA 18e	1800x1200	4.2	54.5	55.1

Notes:

We implement some improvements to the original VFNet. This version of VFNet is called VFNet-X and these improvements include:

PAFPN. We replace the FPN with the PAFPNX (minor modifications are made to the original PAFPN), and apply the DCN and group normalization (GN) in it.
More and Wider Conv Layers. We stack 4 convolution layers in the detection head, instead of 3 layers in the original VFNet, and increase the original 256 feature channels to 384 channels.
RandomCrop and Cutout. We employ the random crop and cutout as additional data augmentation methods.
Wider MSTrain Scale Range and Longer Training. We adopt a wider MSTrain scale range, from 750x500 to 2100x1400, and initially train the VFNet-X for 41 epochs.
SWA. We apply the technique of Stochastic Weight Averaging (SWA) in training the VFNet-X (for another 18 epochs), which brings 1.2 AP gain. Please see our work of SWA Object Detection for more details.
Soft-NMS. We apply soft-NMS in inference.

For more detailed information, please see the VFNet-X config file.

Inference

Assuming you have put the COCO dataset into data/coco/ and have downloaded the models into the checkpoints/, you can now evaluate the models on the COCO val2017 split:

./tools/dist_test.sh configs/vfnet/vfnet_r50_fpn_1x_coco.py checkpoints/vfnet_r50_1x_41.6.pth 8 --eval bbox

Notes:

If you have less than 8 gpus available on your machine, please change 8 into the number of your gpus.
If you want to evaluate a different model, please change the config file (in configs/vfnet) and corresponding model weights file.
Test time augmentation is supported for the VarifocalNet, including multi-scale testing and flip testing. If you are interested, please refer to an example config file vfnet_r50_fpn_1x_coco_tta.py. More information about test time augmentation can be found in the official script test_time_aug.py.

Training

The following command line will train vfnet_r50_fpn_1x_coco on 8 GPUs:

./tools/dist_train.sh configs/vfnet/vfnet_r50_fpn_1x_coco.py 8

Notes:

The models will be saved into work_dirs/vfnet_r50_fpn_1x_coco.
To use fewer GPUs, please change 8 to the number of your GPUs. If you want to keep the mini-batch size to 16, you need to change the samples_per_gpu and workers_per_gpu accordingly, so that samplers_per_gpu x number_of_gpus = 16. In general, workers_per_gpu = samples_per_gpu.
If you use a different mini-batch size, please change the learning rate according to the Linear Scaling Rule, e.g., lr=0.01 for 8 GPUs x 2 img/gpu and lr=0.005 for 4 GPUs x 2 img/gpu.
To train the VarifocalNet with other backbones, please change the config file accordingly.
To train the VarifocalNet on your own dataset, please follow this instruction.

Contributing

Any pull requests or issues are welcome.

Citation

Please consider citing our paper in your publications if the project helps your research. BibTeX reference is as follows:

@inproceedings{zhang2020varifocalnet,
  title={VarifocalNet: An IoU-aware Dense Object Detector},
  author={Zhang, Haoyang and Wang, Ying and Dayoub, Feras and S{\"u}nderhauf, Niko},
  booktitle={CVPR},
  year={2021}
}

Acknowledgment

We would like to thank MMDetection team for producing this great object detection toolbox!

License

This project is released under the Apache 2.0 license.

hyz-xmaster / VarifocalNet

readme

VarifocalNet: An IoU-aware Dense Object Detector

Introduction

Updates

Installation

A Quick Demo

Usage of MMDetection

Results and Models

VFNet-X

Inference

Training

Contributing

Citation

Acknowledgment

License