settings of multi-frame and number of classes (20)

timmeinhardt / trackformer

Implementation of "TrackFormer: Multi-Object Tracking with Transformers”. [Conference on Computer Vision and Pattern Recognition (CVPR), 2022]

https://arxiv.org/abs/2101.02702

Apache License 2.0

509 stars 115 forks source link

settings of multi-frame and number of classes (20) #50

Open liuqk3 opened 2 years ago

liuqk3 commented 2 years ago

Hi @timmeinhardt , thanks for your great work!

After checking the code, I found that (1) the number of clases is set to 20 even only the person is tracked; (2) multi-frame attention is performed, but no discussion is provided in the paper or this repo.. Here come the questions: why you set the number of classes to 20? Does the multi-frame attention contributs to the perfomance gain a lot?

Thanks.

timmeinhardt commented 2 years ago

(1): Computing the focal loss for a single class introduces some noise which we found to be less if we increase the number of classes. The number 20 is a bit arbitrary here. (2): We mention the multi-frame attention in the implementation details. However, since it is not a key element of our contribution and how track queries work we did not provide an in-depth discussion. In particular, the identity preservation (IDF1 and ID switches) benefit a lot from this.

liuqk3 commented 2 years ago

Hi @timmeinhardt ,

Thanks for your great works!

I found that the number of classes is added by 1 during the definition of classifier. https://github.com/timmeinhardt/trackformer/blob/d62d81023dbffb4a1820db39ce527b66df6d7b61/src/trackformer/models/detr.py#L37

At the post process, the last appended class is only removed at this line. But the computation of loss does not remove this appended class. https://github.com/timmeinhardt/trackformer/blob/d62d81023dbffb4a1820db39ce527b66df6d7b61/src/trackformer/models/detr.py#L476

I am curious why you do such settings. Does this have a influence on the tracking performance.

Another question is that the model is optimized with focal loss, which means that sigmoid is used to activate the predicted logits. However, in the post process, softmax is adopted to acitvate the logits and get the prediction scores. Is this a bug or designed by your experimental findings?

timmeinhardt commented 2 years ago

The additional class is added for background prediction in the original DETR formulation. This means also including the class in the loss. However, when running with focal loss, i.e., in the deformable DETR formulation, we do not add the additional class. See this line where we subtract from the number of classes for focal loss

https://github.com/timmeinhardt/trackformer/blob/d62d81023dbffb4a1820db39ce527b66df6d7b61/src/trackformer/models/__init__.py#L34

Your second question is also related to the difference between DETR and Deformable DETR. When running the latter with focal loss we do not apply a softmax in the post processing. See this module:

https://github.com/timmeinhardt/trackformer/blob/d62d81023dbffb4a1820db39ce527b66df6d7b61/src/trackformer/models/deformable_detr.py#L286