Open owen24819 opened 1 year ago
Thanks for the quick response. I think I understand it now. Is self.yolox_embed.weight
the "scale" that is done to the object queries?
Also, I just realized you removed the content queries for the noised track queries and replaced them with the self.refine_embed. I assume this was done to let the network that these are noised track queries, not regular track queries? I figured it would have a tough time to track objects without the content queries and using just positional information but you show it works quite well which is cool.
Hi, really nice work!
I am curious how the transformer is able to differentiate between the track and object queries. I undestand that you use TAN to update track queries but this does not process the object queries. How is the network able to figure out that a track query needs to track an object and has priority over the object queries? Does the network learn to produce decoder output embeddings that code for "track queries".
My second question is why you don't use the attention_mask to prevent infromation leakage from the general predictions to the noised track queries. I assume if the noised track queries see the rest of track queries, it will compromise it's ability to track objects due to the NMS that is performed by the self attention module in the decoder.
https://github.com/megvii-research/MOTRv2/blob/be49b7336218e470c9ebcd34be54fe7eec702675/models/motr.py#L530
Thanks, Owen