Replacing Attention module of Vision Transformer with SelfAttention Module of Performer?

lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch

MIT License

1.08k stars 141 forks source link

Replacing Attention module of Vision Transformer with SelfAttention Module of Performer? #48

Open PascalHbr opened 3 years ago

PascalHbr commented 3 years ago

Hey, thanks for your great work I love it! :) A quick question - in your repo for the Vision Transformer [https://github.com/lucidrains/vit-pytorch] there is a module called Attention. Can I simply use the Vision Transformer and replace the Attention module with the SelfAttention module from the Performer?

lucidrains commented 3 years ago

@PascalHbr hey Pascal! indeed you can! there's actually research groups already investigating this type of attention (linear attention) with vision tasks https://github.com/lucidrains/lambda-networks https://github.com/lucidrains/global-self-attention-network I wouldn't try Performer on vision tasks just yet

NZ42 commented 3 years ago

Hey lucidrains, I'm also interested in applying the Performer to vision. Can I ask why you wouldn't try it just yet?

lucidrains commented 3 years ago

@NZ42 actually, I missed the section on ImageNet in the paper. ok, I take it back, maybe it is worth trying!

NZ42 commented 3 years ago

Thank you for the quick reply. In all honesty I'm interested in substituting the self-attention of vision transformers with FAVOR. I see that in your other repo you use the Linformer. Do you have any tips about how to best approach this? I'm also looking into substituting it in pretrained models from timm.

lucidrains commented 3 years ago

@NZ42 You just need to plug the Performer instance into the efficient wrapper https://github.com/lucidrains/vit-pytorch#efficient-attention

pzzhang commented 3 years ago

@lucidrains I recently used your implementation of performer (https://github.com/microsoft/vision-longformer/blob/main/src/models/layers/performer.py) of linformer (https://github.com/microsoft/vision-longformer/blob/main/src/models/layers/linformer.py) to compare different efficient attention mechanisms in image classification and object detection tasks. See the results reported here: https://github.com/microsoft/vision-longformer. Thank you for your excellent open-sourced code!

@PascalHbr @NZ42 You may be interested in the results, too.