Question regarding the benchmark

kennyvoo commented 4 months ago

Hello, thank you for the great work. I hope to clarify a few questions.

Is the result from Table IV an average of the following process?

Therefore, this study adopts a 7-fold
cross-validation approach, where one video is designated as
the validation set and the remaining six serve as the training
set in each iteration of the validation process.

Does the result of SFSORT in Table IV include post processing?

If I directly use the provided yolov8n and evaluate on MOT17 train set, the MOTA difference between SFSORT and Bytetrack is 10. Is this normal? Or I missed out any configuration or steps. I'll also check if I made any mistake somewhere. bytetrack_trackeval.txt SFSORT_trackeval.txt

Tracker	HOTA	MOTA	IDF1
SFSORT	57.134	55.619	65.488
Bytetrack	60.498	65.208	69.603

I'm using the default configuration and only update the framerate and the frame size for each video.

gitmehrdad commented 4 months ago

Hi, Thank you for your questions.

The main goal of validation processes has always been to maximize the HOTA. MOTA is heavily reliant on the quality of detection and is not a fair measure for assessing tracking accuracy. The reported results represent the average of the outcomes observed during the validation stages.

In Table IV, the results after post-processing are reported because otherwise, there would be few items left in the table for comparison.

Based on my research, YOLOX outperforms YOLOv8 for high-accuracy tracking, as evidenced by the findings presented in Table III and Figure 8 of the paper. However, if you require a very fast tracker, particularly for tracking fast-moving objects like a golf ball, I recommend utilizing YOLOv8.

kennyvoo commented 4 months ago

Thank you for the prompt reply! This is an excellence work! Even without postprocessing, it got very good result after hyperparameter tuning. Can you please advise if there's any mistake or any more improvement that I can made from the following result?

Following the guide

LTH and MTH2 are hyperparameters influenced
by the behavior of the object detector

both HTH and MTH1 should be decreased
to achieve an adaptive tracker

Therefore, NTH should be increased to
prevent identity switching.

I've changed the following parameters trying to improve the HOTA (Just trying out a few different value)	Param	original
high_th	0.82	0.7
match_th_first	0.5	0.6
match_th_second	0.1	0.4
low_th	0.3	0.2
new_track_th	0.7	0.5

If the objective is to increase HOTA and IDF1, reducing NTH seem to bring quite a significant boost to HOTA and IDF1 but at the expense of having higher ID switches. So, does it make more sense to keep the IDs lower but with lower HOTA?

Tracker	HOTA	MOTA	IDF1	IDs
Bytetrack	60.498	65.208	69.603	552
SFSORT (default)	57.134	55.619	65.488	463
SFSORT (new + NTH=0.7)	58.36	58.498	66.992	391
SFSORT (new + NTH=0.6)	61.644	66.384	70.69	575
SFSORT (new + NTH=0.5)	61.963	68.427	70.784	714

Another additional question is

how do you measure the fps?

From my experiment, just measuring the predict function for both SFSORT and Bytetrack v1, the differences are only around ~x4 only.

gitmehrdad commented 4 months ago

Thank you for the information you shared. The default configuration of SFSORT is aligned with YOLOX. As you mentioned, it's advisable to adjust the hyperparameters when employing a new object detector.

During my studies, I've discovered that the choice of object detector greatly influences tracking accuracy. While reaching a HOTA above 90% on the MOT17 dataset might seem challenging due to its specific assumptions, I've observed HOTA values nearing 95% on videos sourced from other datasets, thanks to meticulous fine-tuning of the object detector. To attain higher accuracies on MOT17 and MOT20, a method that proves time-consuming yet highly effective involves fine-tuning cutting-edge object detectors, like YOLO9, using diverse human image datasets. ByteTrack employed this strategy with YOLOX, which greatly contributed to its success.

The preference between IDs or HOTA depends on the application. In offline tracking scenarios where ID correction can be facilitated through post-processing, higher HOTA is often preferred. Conversely, in tracking crowded scenes, prioritizing higher IDs may be preferred.

When calculating the tracking speed, I measured it from the moment the detections were delivered to the tracker until the IDs were received from the tracker, following the advice provided on the MOTChallenge website. Considering that background noise, such as that from the OS, server, IDE, etc., can impact measurement accuracy, the reported speed in the paper is the average speed obtained after several repetitions of the experiment. To measure the tracking speed, I utilized the "time" and "timedelta" packages in Python.

gitmehrdad / SFSORT

Question regarding the benchmark #3