As the title says, has anyone tried replacing multi head attention in a typical transformer with the self attention as described in this library.
my thought was that I can essentially concat the multiple self attention elements together to replicate this per the attached image from the torch website.
I'm relatively new to transformers as a whole so hopefully this question makes some sense.
for reference, considering the interest in a previous post, I've been attempting to explore performer effectiveness with DETR (https://github.com/facebookresearch/detr)
As the title says, has anyone tried replacing multi head attention in a typical transformer with the self attention as described in this library.
my thought was that I can essentially concat the multiple self attention elements together to replicate this per the attached image from the torch website.
I'm relatively new to transformers as a whole so hopefully this question makes some sense.
for reference, considering the interest in a previous post, I've been attempting to explore performer effectiveness with DETR (https://github.com/facebookresearch/detr)
thanks!