hustvl / TeViT

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral
https://arxiv.org/abs/2204.08412
MIT License
238 stars 17 forks source link
instance-segmentation video-instance-segmentation video-understanding

Temporally Efficient Vision Transformer for Video Instance Segmentation

Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR 2022, Oral)

by Shusheng Yang1,3, Xinggang Wang1 :email:, Yu Li4, Yuxin Fang1, Jiemin Fang1,2, Wenyu Liu1, Xun Zhao3, Ying Shan3.

1 School of EIC, HUST, 2 AIA, HUST, 3 ARC Lab, Tencent PCG, 4 IDEA.

(:email:) corresponding author.



Overall Arch

Models and Main Results

Name AP AP@50 AP@75 AR@1 AR@10 Params model submission
TeViT_MsgShifT 46.3 70.6 50.9 45.2 54.3 161.83 M link link
TeViT_MsgShifT_MST 46.9 70.1 52.9 45.0 53.4 161.83 M link link
Name AP AP@50 AP@75 AR@1 AR@10 Params model submission
TeViT_R50 42.1 67.8 44.8 41.3 49.9 172.3 M link link
TeViT_Swin-L_MST 56.8 80.6 63.1 52.0 63.3 343.86 M link link

Installation

Prerequisites

Prepare

git clone https://github.com/hustvl/TeViT.git

Inference

python tools/test_vis.py configs/tevit/tevit_msgshift.py $PATH_TO_CHECKPOINT

After inference process, the predicted results is stored in results.json, submit it to the evaluation server to get the final performance.

Training

Acknowledgement :heart:

This code is mainly based on mmdetection and QueryInst, thanks for their awesome work and great contributions to the computer vision community!

Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :

@inproceedings{yang2022tevit,
  title={Temporally Efficient Vision Transformer for Video Instance Segmentation,
  author={Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu and Zhao, Xun and Shan, Ying},
  booktitle =   {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
  year      =   {2022}
}