by Qianyu Zhou, Xiangtai Li, Lu He, [Yibo Yang](), [Guangliang Cheng](), [Yunhai Tong](), Lizhuang Ma, [Dacheng Tao]()
(TPAMI 2023) TransVOD:End-to-End Video Object Detection with Spatial-Temporal Transformers.
:bell: We are happy to announce that TransVOD was accepted by IEEE TPAMI.
:bell: We are happy to announce that our method is the first work that achieves 90% mAP on ImageNet VID dataset.
If you find TransVOD useful in your research, please consider citing:
@article{zhou2022transvod,
author={Zhou, Qianyu and Li, Xiangtai and He, Lu and Yang, Yibo and Cheng, Guangliang and Tong, Yunhai and Ma, Lizhuang and Tao, Dacheng}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers},
year={2022},
pages={1-16},
doi={10.1109/TPAMI.2022.3223955}}
@inproceedings{he2021end,
title={End-to-End Video Object Detection with Spatial-Temporal Transformers},
author={He, Lu and Zhou, Qianyu and Li, Xiangtai and Niu, Li and Cheng, Guangliang and Li, Xiao and Liu, Wenxuan and Tong, Yunhai and Ma, Lizhuang and Zhang, Liqing},
booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
pages={1507--1516},
year={2021}
}
Our proposed method TransVOD Lite, achieving the best tradeoff between the speed and accuracy with different backbones. SwinB, SwinS and SwinT mean Swin Base, Small and Tiny.
Note:
The codebase is built on top of Deformable DETR and TransVOD.
Linux, CUDA>=9.2, GCC>=5.4
Python>=3.7
We recommend you to use Anaconda to create a conda environment:
conda create -n TransVOD python=3.7 pip
Then, activate the environment:
conda activate TransVOD
PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here
For example, if your CUDA version is 9.2, you could install pytorch and torchvision as following:
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch
Other requirements
pip install -r requirements.txt
Build MultiScaleDeformableAttention
cd ./models/ops
sh ./make.sh
Below, we provide checkpoints, training logs and inference logs of TransVOD Lite for different backbones.
DownLoad Link of Baidu Netdisk (password:26xc)
code_root/
└── data/
└── vid/
├── Data
├── VID/
└── DET/
└── annotations/
├── imagenet_vid_train.json
├── imagenet_vid_train_joint_30.json
└── imagenet_vid_val.json
We use Swin Transformer as the network backbone. We train our TransVOD with Swin-base as backbone as following:
GPUS_PER_NODE=8 ./tools/run_dist_launch.sh $1 swinb $2 configs/swinb_train_single.sh
GPUS_PER_NODE=8 ./tools/run_dist_launch.sh $1 swinb $2 configs/swinb_train_multi.sh
If you are using slurm cluster, you can simply run the following command to train on 1 node with 8 GPUs:
GPUS_PER_NODE=8 ./tools/run_dist_slurm.sh <partition> swinb 8 configs/swinb_train_multi.sh
You can get the config file and pretrained model of TransVOD (the link is in "Checkpoint" session), then put the pretrained_model into correponding folder.
code_root/
└── exps/
└── our_models/
├── COCO_pretrained_model
├── exps_single
└── exps_multi
And then run following command to evaluate it on ImageNET VID validation set:
GPUS_PER_NODE=8 ./tools/run_dist_launch.sh $1 eval_swinb $2 configs/swinb_eval_multi.sh
This project is based on the following open-source projects. We thank their authors for making the source code publically available.
This project is released under the Apache License 2.0, while some specific features in this repository are with other licenses. Please refer to LICENSES.md for the careful check, if you are using our code for commercial matters.