timmeinhardt / trackformer

Implementation of "TrackFormer: Multi-Object Tracking with Transformers”. [Conference on Computer Vision and Pattern Recognition (CVPR), 2022]
https://arxiv.org/abs/2101.02702
Apache License 2.0
509 stars 115 forks source link

How do you select the initial track queries from the object queries? #65

Closed Tsunehiko closed 1 year ago

Tsunehiko commented 1 year ago

Thank you for the wonderful work. I have read the paper and code, and have a question about track query initialization.

How do you select the initial track queries from the object queries in the evaluation? In the paper, the following sentences are stated,

Each valid object detection {b00, b10, . . . } with a classification score above σobject, i.e., output embedding not predicting the background class (crossed), initializes a new track query embedding.

After reading this, I expected to add the object queries with non-zero class labels to the new track queries. However, when looking at the code, it seems to be extracting only those that match 0.

new_det_keep = torch.logical_and(
   new_det_scores > self.detection_obj_score_thresh,
   result['labels'][-self.num_object_queries:] == 0)

I believe what is written in the paper is correct, but this implementation is beyond my understanding, could you please tell me what is happening in the implementation? Or if I have extracted the wrong part of the implementation, please let me know the correct part.

timmeinhardt commented 1 year ago

You got the labels mixed up. Zero is the label for person and for focal loss we do not have a specific background label. Every query outputs binary per-class predictions and the threshold self.detection_obj_score_thresh decides if it is considered background or not.

Tsunehiko commented 1 year ago

Thank you for your reply. I understood the background class, but I don't know why did you use result['labels'][-self.num_object_queries:] == 0. Is it to limit the query to the person class? (If so, I don't know why you set the class num to the large value of 20 in MOT17.)

timmeinhardt commented 1 year ago

The project is targeted at pedestrian/person tracking that is why we only allow those predictions. But we trained on more than a single output neuron as this is supposed stabilize the training. Effectively, we are only using the very first neuron though.

Tsunehiko commented 1 year ago

I understood it. Thank you for your detailed explanation.