lucidrains / BS-RoFormer

Implementation of Band Split Roformer, SOTA Attention network for music source separation out of ByteDance AI Labs
MIT License
384 stars 13 forks source link

Gates in Attention module of bs_roformer.py #27

Closed Psarpei closed 8 months ago

Psarpei commented 8 months ago

I am a bit confused of the gates in the Attention module of the bs_roformer.py. The code in lines 103-105 is

out = self.attend(q, k, v)
gates = self.to_gates(x)
out = out * rearrange(gates, 'b n h -> b h n 1').sigmoid()

out = rearrange(out, 'b h n d -> b n (h d)')
return self.to_out(out)

From my understanding this is not the standard multi head attention approach and the paper has not mentioned anything about using something else. Therefore, I would remove the parts using gates resulting in the following code:

out = self.attend(q, k, v)

out = rearrange(out, 'b h n d -> b n (h d)')
out = self.to_out(out)

What got I wrong? What is the gates for and why is it used can you clear this up?

lucidrains commented 8 months ago

@Psarpei i'm just applying some recent attention research i believe in

lucidrains commented 8 months ago

incidentally, this attention formulation was also used in the successful alphafold2

Psarpei commented 8 months ago

Thanks for your fast reply! I will check the paper out :) So you believe its always better to include the part of the gates in MHA and MHCA as well?

lucidrains commented 8 months ago

@Psarpei i like to either include head-wise gating, or a few memory key / values, if full memory / register tokens cannot be used

all of these engineering choices are addressing one of the underlying issues in attention

Psarpei commented 8 months ago

Okay thanks very much I will definitely check both out and try to understand when it's applicable and helpful :)