megvii-research / MOTRv2

[CVPR2023] MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors
Other
383 stars 47 forks source link

Question about track and object queries #7

Open owen24819 opened 1 year ago

owen24819 commented 1 year ago

Hi, really nice work!

I am curious how the transformer is able to differentiate between the track and object queries. I undestand that you use TAN to update track queries but this does not process the object queries. How is the network able to figure out that a track query needs to track an object and has priority over the object queries? Does the network learn to produce decoder output embeddings that code for "track queries".

My second question is why you don't use the attention_mask to prevent infromation leakage from the general predictions to the noised track queries. I assume if the noised track queries see the rest of track queries, it will compromise it's ability to track objects due to the NMS that is performed by the self attention module in the decoder.

https://github.com/megvii-research/MOTRv2/blob/be49b7336218e470c9ebcd34be54fe7eec702675/models/motr.py#L530

Thanks, Owen

fengxiuyaun commented 1 year ago
  1. object query is from scale+learned weight, track query is from QIM. network learned diff between。
  2. I think you are right, needing to add attn_mask[:n_dt, n_dt:] = True
owen24819 commented 1 year ago

Thanks for the quick response. I think I understand it now. Is self.yolox_embed.weight the "scale" that is done to the object queries?

Also, I just realized you removed the content queries for the noised track queries and replaced them with the self.refine_embed. I assume this was done to let the network that these are noised track queries, not regular track queries? I figured it would have a tough time to track objects without the content queries and using just positional information but you show it works quite well which is cool.