ByteTrack: Multi-Object Tracking by Associating Every Detection Box
Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, Xinggang Wang
Google Colab Demo | Huggingface Demo | YouTube Tutorial | Original Paper: ByteTrack |
---|---|---|---|
arXiv 2110.06864 |
Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos. Most methods obtain identities by associating detection boxes whose scores are higher than a threshold. The objects with low detection scores, e.g. occluded objects, are simply thrown away, which brings non-negligible true object missing and fragmented trajectories. To solve this problem, we present a simple, effective and generic association method, tracking by associating every detection box instead of only the high score ones. For the low score detection boxes, we utilize their similarities with tracklets to recover true objects and filter out the background detections. When applied to 9 different state-of-the-art trackers, our method achieves consistent improvement on IDF1 scores ranging from 1 to 10 points. To put forwards the state-of-the-art performance of MOT, we design a simple and strong tracker, named ByteTrack. For the first time, we achieve 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on the test set of MOT17 with 30 FPS running speed on a single V100 GPU.
Dataset | MOTA | IDF1 | HOTA | MT | ML | FP | FN | IDs | FPS |
---|---|---|---|---|---|---|---|---|---|
MOT17 | 80.3 | 77.3 | 63.1 | 53.2% | 14.5% | 25491 | 83721 | 2196 | 29.6 |
MOT20 | 77.8 | 75.2 | 61.3 | 69.2% | 9.5% | 26249 | 87594 | 1223 | 13.7 |
Step1. Install ByteTrack.
git clone https://github.com/ifzhang/ByteTrack.git
cd ByteTrack
pip3 install -r requirements.txt
python3 setup.py develop
Step2. Install pycocotools.
pip3 install cython; pip3 install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
Step3. Others
pip3 install cython_bbox
docker build -t bytetrack:latest .
# Startup sample
mkdir -p pretrained && \
mkdir -p YOLOX_outputs && \
xhost +local: && \
docker run --gpus all -it --rm \
-v $PWD/pretrained:/workspace/ByteTrack/pretrained \
-v $PWD/datasets:/workspace/ByteTrack/datasets \
-v $PWD/YOLOX_outputs:/workspace/ByteTrack/YOLOX_outputs \
-v /tmp/.X11-unix/:/tmp/.X11-unix:rw \
--device /dev/video0:/dev/video0:mwr \
--net=host \
-e XDG_RUNTIME_DIR=$XDG_RUNTIME_DIR \
-e DISPLAY=$DISPLAY \
--privileged \
bytetrack:latest
Download MOT17, MOT20, CrowdHuman, Cityperson, ETHZ and put them under
datasets
|——————mot
| └——————train
| └——————test
└——————crowdhuman
| └——————Crowdhuman_train
| └——————Crowdhuman_val
| └——————annotation_train.odgt
| └——————annotation_val.odgt
└——————MOT20
| └——————train
| └——————test
└——————Cityscapes
| └——————images
| └——————labels_with_ids
└——————ETHZ
└——————eth01
└——————...
└——————eth07
Then, you need to turn the datasets to COCO format and mix different training data:
cd <ByteTrack_HOME>
python3 tools/convert_mot17_to_coco.py
python3 tools/convert_mot20_to_coco.py
python3 tools/convert_crowdhuman_to_coco.py
python3 tools/convert_cityperson_to_coco.py
python3 tools/convert_ethz_to_coco.py
Before mixing different datasets, you need to follow the operations in mix_xxx.py to create a data folder and link. Finally, you can mix the training data:
cd <ByteTrack_HOME>
python3 tools/mix_data_ablation.py
python3 tools/mix_data_test_mot17.py
python3 tools/mix_data_test_mot20.py
Train on CrowdHuman and MOT17 half train, evaluate on MOT17 half val
Model | MOTA | IDF1 | IDs | FPS |
---|---|---|---|---|
ByteTrack_ablation [google], [baidu(code:eeo8)] | 76.6 | 79.3 | 159 | 29.6 |
Train on CrowdHuman, MOT17, Cityperson and ETHZ, evaluate on MOT17 train.
Model | MOTA | IDF1 | IDs | FPS |
---|---|---|---|---|
bytetrack_x_mot17 [google], [baidu(code:ic0i)] | 90.0 | 83.3 | 422 | 29.6 |
bytetrack_l_mot17 [google], [baidu(code:1cml)] | 88.7 | 80.7 | 460 | 43.7 |
bytetrack_m_mot17 [google], [baidu(code:u3m4)] | 87.0 | 80.1 | 477 | 54.1 |
bytetrack_s_mot17 [google], [baidu(code:qflm)] | 79.2 | 74.3 | 533 | 64.5 |
Model | MOTA | IDF1 | IDs | Params(M) | FLOPs(G) |
---|---|---|---|---|---|
bytetrack_nano_mot17 [google], [baidu(code:1ub8)] | 69.0 | 66.3 | 531 | 0.90 | 3.99 |
bytetrack_tiny_mot17 [google], [baidu(code:cr8i)] | 77.1 | 71.5 | 519 | 5.03 | 24.45 |
Train on CrowdHuman and MOT20, evaluate on MOT20 train.
Model | MOTA | IDF1 | IDs | FPS |
---|---|---|---|---|
bytetrack_x_mot20 [google], [baidu(code:3apd)] | 93.4 | 89.3 | 1057 | 17.5 |
The COCO pretrained YOLOX model can be downloaded from their model zoo. After downloading the pretrained models, you can put them under
cd <ByteTrack_HOME>
python3 tools/train.py -f exps/example/mot/yolox_x_ablation.py -d 8 -b 48 --fp16 -o -c pretrained/yolox_x.pth
cd <ByteTrack_HOME>
python3 tools/train.py -f exps/example/mot/yolox_x_mix_det.py -d 8 -b 48 --fp16 -o -c pretrained/yolox_x.pth
For MOT20, you need to clip the bounding boxes inside the image.
Add clip operation in line 134-135 in data_augment.py, line 122-125 in mosaicdetection.py, line 217-225 in mosaicdetection.py, line 115-118 in boxes.py.
cd <ByteTrack_HOME>
python3 tools/train.py -f exps/example/mot/yolox_x_mix_mot20_ch.py -d 8 -b 48 --fp16 -o -c pretrained/yolox_x.pth
First, you need to prepare your dataset in COCO format. You can refer to MOT-to-COCO or CrowdHuman-to-COCO. Then, you need to create a Exp file for your dataset. You can refer to the CrowdHuman training Exp file. Don't forget to modify get_data_loader() and get_eval_loader in your Exp file. Finally, you can train bytetrack on your dataset by running:
cd <ByteTrack_HOME>
python3 tools/train.py -f exps/example/mot/your_exp_file.py -d 8 -b 48 --fp16 -o -c pretrained/yolox_x.pth
Run ByteTrack:
cd <ByteTrack_HOME>
python3 tools/track.py -f exps/example/mot/yolox_x_ablation.py -c pretrained/bytetrack_ablation.pth.tar -b 1 -d 1 --fp16 --fuse
You can get 76.6 MOTA using our pretrained model.
Run other trackers:
python3 tools/track_sort.py -f exps/example/mot/yolox_x_ablation.py -c pretrained/bytetrack_ablation.pth.tar -b 1 -d 1 --fp16 --fuse
python3 tools/track_deepsort.py -f exps/example/mot/yolox_x_ablation.py -c pretrained/bytetrack_ablation.pth.tar -b 1 -d 1 --fp16 --fuse
python3 tools/track_motdt.py -f exps/example/mot/yolox_x_ablation.py -c pretrained/bytetrack_ablation.pth.tar -b 1 -d 1 --fp16 --fuse
Run ByteTrack:
cd <ByteTrack_HOME>
python3 tools/track.py -f exps/example/mot/yolox_x_mix_det.py -c pretrained/bytetrack_x_mot17.pth.tar -b 1 -d 1 --fp16 --fuse
python3 tools/interpolation.py
Submit the txt files to MOTChallenge website and you can get 79+ MOTA (For 80+ MOTA, you need to carefully tune the test image size and high score detection threshold of each sequence).
We use the input size 1600 x 896 for MOT20-04, MOT20-07 and 1920 x 736 for MOT20-06, MOT20-08. You can edit it in yolox_x_mix_mot20_ch.py
Run ByteTrack:
cd <ByteTrack_HOME>
python3 tools/track.py -f exps/example/mot/yolox_x_mix_mot20_ch.py -c pretrained/bytetrack_x_mot20.pth.tar -b 1 -d 1 --fp16 --fuse --match_thresh 0.7 --mot20
python3 tools/interpolation.py
Submit the txt files to MOTChallenge website and you can get 77+ MOTA (For higher MOTA, you need to carefully tune the test image size and high score detection threshold of each sequence).
See tutorials.
Suppose you have already got the detection results 'dets' (x1, y1, x2, y2, score) from other detectors, you can simply pass the detection results to BYTETracker (you need to first modify some post-processing code according to the format of your detection results in byte_tracker.py):
from yolox.tracker.byte_tracker import BYTETracker
tracker = BYTETracker(args)
for image in images:
dets = detector(image)
online_targets = tracker.update(dets, info_imgs, img_size)
You can get the tracking results in each frame from 'online_targets'. You can refer to mot_evaluators.py to pass the detection results to BYTETracker.
cd <ByteTrack_HOME>
python3 tools/demo_track.py video -f exps/example/mot/yolox_x_mix_det.py -c pretrained/bytetrack_x_mot17.pth.tar --fp16 --fuse --save_result
@article{zhang2022bytetrack,
title={ByteTrack: Multi-Object Tracking by Associating Every Detection Box},
author={Zhang, Yifu and Sun, Peize and Jiang, Yi and Yu, Dongdong and Weng, Fucheng and Yuan, Zehuan and Luo, Ping and Liu, Wenyu and Wang, Xinggang},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2022}
}
A large part of the code is borrowed from YOLOX, FairMOT, TransTrack and JDE-Cpp. Many thanks for their wonderful works.