Open BNU-IVC opened 3 weeks ago
Hi, this operation with softmax normalization on query dimension is inherited from slot attention [1], which introduces competition between different queries to encourage different query tokens take over different feature components, potentially discriminating different visual semantics or concepts. In our architecture, we follow this approach to update the concept-related queries.
[1] Locatello, Francesco, et al. "Object-centric learning with slot attention." Advances in neural information processing systems 33 (2020): 11525-11538.
Thanks for your contribution. I'm confused about the cross attention in the local transformer. It seems that this softmax is applied to the query, without any aggregation effect. Is this an error or is there a specific principle behind it?