Using replicating nn.MultiHeadAttention with multiple performer SelfAttention modules

As the title says, has anyone tried replacing multi head attention in a typical transformer with the self attention as described in this library.

my thought was that I can essentially concat the multiple self attention elements together to replicate this per the attached image from the torch website.

I'm relatively new to transformers as a whole so hopefully this question makes some sense.

for reference, considering the interest in a previous post, I've been attempting to explore performer effectiveness with DETR (https://github.com/facebookresearch/detr)

thanks!

lucidrains / performer-pytorch

Using replicating nn.MultiHeadAttention with multiple performer SelfAttention modules #91