jiayannan / VITT

7 stars 1 forks source link

VITT

Introduction

This repo is the a codebase of the ViTT: Vision Transformer Tracker model.ViTT uses Transformer as the backbone network to build a multi-task learning model, which can detect objects and extract appearance embedding simultaneously in a single network. Our work demonstrates the effectiveness of Transformer based network in complex computer vision tasks, and paves the way for the application of pure Transformer in MOT.

Requirements

Test on MOT-16 Challenge

python track.py --cfg ./path/to/model/cfg --weights /path/to/model/weights

Training instruction

We use 8x Nvidia Titan Xp to train the model, with a batch size of 32. You can adjust the batch size (and the learning rate together) according to how many GPUs your have. You can also train with smaller image size, which will bring faster inference time. But note the image size had better to be multiples of 32 (the down-sampling rate).

Train with custom datasets

Adding custom datsets is quite simple, all you need to do is to organize your annotation files in the same format as in our training sets.

Citation

If you find this repo useful in your project or research, please consider citing it:

@article{ViTT,
  title={ViTT: Vision Transformer Tracker},
  author={Xiaoning Zhu, Yannan Jia, Sun Jian, Zhang Pu},
  year={2019}
}