lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch
MIT License
1.08k stars 141 forks source link

Using replicating nn.MultiHeadAttention with multiple performer SelfAttention modules #91

Open JGittles opened 1 year ago

JGittles commented 1 year ago

As the title says, has anyone tried replacing multi head attention in a typical transformer with the self attention as described in this library.

my thought was that I can essentially concat the multiple self attention elements together to replicate this per the attached image from the torch website. image

I'm relatively new to transformers as a whole so hopefully this question makes some sense.

for reference, considering the interest in a previous post, I've been attempting to explore performer effectiveness with DETR (https://github.com/facebookresearch/detr)

thanks!