This repo is the a codebase of the ViTT: Vision Transformer Tracker model.ViTT uses Transformer as the backbone network to build a multi-task learning model, which can detect objects and extract appearance embedding simultaneously in a single network. Our work demonstrates the effectiveness of Transformer based network in complex computer vision tasks, and paves the way for the application of pure Transformer in MOT.
pip install motmetrics
)pip install cython_bbox
)python track.py --cfg ./path/to/model/cfg --weights /path/to/model/weights
cfg/ccmcpe.json
, config the training/validation combinations. A dataset is represented by an image list, please see data/*.train
for example. python train.py --cfg ./path/to/model/cfg
We use 8x Nvidia Titan Xp to train the model, with a batch size of 32. You can adjust the batch size (and the learning rate together) according to how many GPUs your have. You can also train with smaller image size, which will bring faster inference time. But note the image size had better to be multiples of 32 (the down-sampling rate).
Adding custom datsets is quite simple, all you need to do is to organize your annotation files in the same format as in our training sets.
If you find this repo useful in your project or research, please consider citing it:
@article{ViTT,
title={ViTT: Vision Transformer Tracker},
author={Xiaoning Zhu, Yannan Jia, Sun Jian, Zhang Pu},
year={2019}
}