shvdiwnkozbw / Self-supervised-Video-Concept

Code for Static and Dynamic Concepts for Self-supervised Video Representation Learning.
10 stars 1 forks source link

About Cross Attention #3

Open BNU-IVC opened 3 weeks ago

BNU-IVC commented 3 weeks ago
        # attention
        q = q * self.scale  # Normalization.
        attn_logits =  torch.einsum('bnd,bld->bln', q, k)
        attn = self.softmax(attn_logits)
        attn = attn + 1e-8 # to avoid zero when with the L1 norm below
        attn = attn / attn.sum(dim=-2, keepdim=True)

        # update template
        templates = torch.einsum('bld,bln->bnd', v, attn) + templates_prev

Thanks for your contribution. I'm confused about the cross attention in the local transformer. It seems that this softmax is applied to the query, without any aggregation effect. Is this an error or is there a specific principle behind it?

shvdiwnkozbw commented 3 weeks ago

Hi, this operation with softmax normalization on query dimension is inherited from slot attention [1], which introduces competition between different queries to encourage different query tokens take over different feature components, potentially discriminating different visual semantics or concepts. In our architecture, we follow this approach to update the concept-related queries.

[1] Locatello, Francesco, et al. "Object-centric learning with slot attention." Advances in neural information processing systems 33 (2020): 11525-11538.