Closed MinGiSa closed 1 month ago
LinearAttention2 with the reduced form of Q(KV) can not support attn_drop because there is no explict attention map. Attn_drop drops the elements of attention map QK, which is avaliable in LinearAttention in (QK)V form. In addition, I think attn_drop is not really needed in linear attention that already do not focus.
We do not reconstruct attention maps (QK in Attention modules); we reconstruct features. The idea of loose reconstruction is not about what is input to the decoder. It is about the reconstruction subject (decoder features) and object (encoder features). Instead of let each decoder layer to reconstruct each encoder layer as convention, we proposed to average the decoder layers to reconstruct the average of encoder layers. And there is no fuse layer (the .fuse fuction in the code is simply average).
E.g., Previous reconstruction paradigm: D_1==>E_1 D_2==>E_2 D_3==>E_3 D_4==>E_4
Ours: (D_1+D_2+D_3+D_4) ==> (E_1+E_2+E_3+E_4)
Can you explain why self.attn_drop is not used in Linear Attention2? Also, could you simplify the concept of loose reconstruction? I understand it as using a fuse layer to combine low-level and high-level attention maps to reconstruct the attention map in a non-local manner. but I don't fully grasp the concept.