Closed Psarpei closed 11 months ago
@Psarpei i'm just applying some recent attention research i believe in
incidentally, this attention formulation was also used in the successful alphafold2
Thanks for your fast reply! I will check the paper out :) So you believe its always better to include the part of the gates in MHA and MHCA as well?
@Psarpei i like to either include head-wise gating, or a few memory key / values, if full memory / register tokens cannot be used
all of these engineering choices are addressing one of the underlying issues in attention
Okay thanks very much I will definitely check both out and try to understand when it's applicable and helpful :)
I am a bit confused of the
gates
in the Attention module of thebs_roformer.py
. The code in lines 103-105 isFrom my understanding this is not the standard multi head attention approach and the paper has not mentioned anything about using something else. Therefore, I would remove the parts using
gates
resulting in the following code:What got I wrong? What is the
gates
for and why is it used can you clear this up?