Gates in Attention module of bs_roformer.py

Psarpei commented 11 months ago

I am a bit confused of the gates in the Attention module of the bs_roformer.py. The code in lines 103-105 is

out = self.attend(q, k, v)
gates = self.to_gates(x)
out = out * rearrange(gates, 'b n h -> b h n 1').sigmoid()

out = rearrange(out, 'b h n d -> b n (h d)')
return self.to_out(out)

From my understanding this is not the standard multi head attention approach and the paper has not mentioned anything about using something else. Therefore, I would remove the parts using gates resulting in the following code:

out = self.attend(q, k, v)

out = rearrange(out, 'b h n d -> b n (h d)')
out = self.to_out(out)

What got I wrong? What is the gates for and why is it used can you clear this up?

lucidrains commented 11 months ago

@Psarpei i'm just applying some recent attention research i believe in

lucidrains commented 11 months ago

incidentally, this attention formulation was also used in the successful alphafold2

Psarpei commented 11 months ago

Thanks for your fast reply! I will check the paper out :) So you believe its always better to include the part of the gates in MHA and MHCA as well?

lucidrains commented 11 months ago

@Psarpei i like to either include head-wise gating, or a few memory key / values, if full memory / register tokens cannot be used

all of these engineering choices are addressing one of the underlying issues in attention

Psarpei commented 11 months ago

Okay thanks very much I will definitely check both out and try to understand when it's applicable and helpful :)

lucidrains / BS-RoFormer

Gates in Attention module of bs_roformer.py #27