This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.
TL; DR. MOTR is a fully end-to-end multiple-object tracking framework based on Transformer. It directly outputs the tracks within the video sequences without any association procedures.
Abstract. The key challenge in multiple-object tracking task is temporal modeling of the object under track. Existing tracking-by-detection methods adopt simple heuristics, such as spatial or appearance similarity. Such methods, in spite of their commonality, are overly simple and lack the ability to learn temporal variations from data in an end-to-end manner.In this paper, we present MOTR, a fully end-to-end multiple-object tracking framework. It learns to model the long-range temporal variation of the objects. It performs temporal association implicitly and avoids previous explicit heuristics. Built upon DETR, MOTR introduces the concept of "track query". Each track query models the entire track of an object. It is transferred and updated frame-by-frame to perform iterative predictions in a seamless manner. Tracklet-aware label assignment is proposed for one-to-one assignment between track queries and object tracks. Temporal aggregation network together with collective average loss is further proposed to enhance the long-range temporal relation. Experimental results show that MOTR achieves competitive performance and can serve as a strong Transformer-based baseline for future research.
Method | Dataset | Train Data | HOTA | DetA | AssA | MOTA | IDF1 | IDS | URL |
---|---|---|---|---|---|---|---|---|---|
MOTR | MOT17 | MOT17+CrowdHuman Val | 57.8 | 60.3 | 55.7 | 73.4 | 68.6 | 2439 | model |
Method | Dataset | Train Data | HOTA | DetA | AssA | MOTA | IDF1 | URL |
---|---|---|---|---|---|---|---|---|
MOTR | DanceTrack | DanceTrack | 54.2 | 73.5 | 40.2 | 79.7 | 51.5 | model |
Method | Dataset | Train Data | MOTA | IDF1 | IDS | URL |
---|---|---|---|---|---|---|
MOTR | BDD100K | BDD100K | 32.0 | 43.5 | 3493 | model |
Note:
The codebase is built on top of Deformable DETR.
Linux, CUDA>=9.2, GCC>=5.4
Python>=3.7
We recommend you to use Anaconda to create a conda environment:
conda create -n deformable_detr python=3.7 pip
Then, activate the environment:
conda activate deformable_detr
PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here)
For example, if your CUDA version is 9.2, you could install pytorch and torchvision as following:
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch
Other requirements
pip install -r requirements.txt
Build MultiScaleDeformableAttention
cd ./models/ops
sh ./make.sh
.
├── crowdhuman
│ ├── images
│ └── labels_with_ids
├── MOT15
│ ├── images
│ ├── labels_with_ids
│ ├── test
│ └── train
├── MOT17
│ ├── images
│ ├── labels_with_ids
├── DanceTrack
│ ├── train
│ ├── test
├── bdd100k
│ ├── images
│ ├── track
│ ├── train
│ ├── val
│ ├── labels
│ ├── track
│ ├── train
│ ├── val
cd datasets/data_path
python3 generate_bdd100k_mot.py
cd ../../
You can download COCO pretrained weights from Deformable DETR. Then training MOTR on 8 GPUs as following:
sh configs/r50_motr_train.sh
You can download the pretrained model of MOTR (the link is in "Main Results" session), then run following command to evaluate it on MOT15 train dataset:
sh configs/r50_motr_eval.sh
For visual in demo video, you can enable 'vis=True' in eval.py like:
det.detect(vis=True)
You can download the pretrained model of MOTR (the link is in "Main Results" session), then run following command to evaluate it on MOT17 test dataset (submit to server):
sh configs/r50_motr_submit.sh
For BDD100K dataset, please refer motr_bdd100k.
We also provide a demo interface which allows for a quick processing of a given video.
EXP_DIR=exps/e2e_motr_r50_joint
python3 demo.py \
--meta_arch motr \
--dataset_file e2e_joint \
--epoch 200 \
--with_box_refine \
--lr_drop 100 \
--lr 2e-4 \
--lr_backbone 2e-5 \
--pretrained ${EXP_DIR}/motr_final.pth \
--output_dir ${EXP_DIR} \
--batch_size 1 \
--sample_mode 'random_interval' \
--sample_interval 10 \
--sampler_steps 50 90 120 \
--sampler_lengths 2 3 4 5 \
--update_query_pos \
--merger_dropout 0 \
--dropout 0 \
--random_drop 0.1 \
--fp_ratio 0.3 \
--query_interaction_layer 'QIM' \
--extra_track_attn \
--resume ${EXP_DIR}/motr_final.pth \
--input_video figs/demo.avi
If you find MOTR useful in your research, please consider citing:
@inproceedings{zeng2021motr,
title={MOTR: End-to-End Multiple-Object Tracking with TRansformer},
author={Zeng, Fangao and Dong, Bin and Zhang, Yuang and Wang, Tiancai and Zhang, Xiangyu and Wei, Yichen},
booktitle={European Conference on Computer Vision (ECCV)},
year={2022}
}