lucidrains / BS-RoFormer

Implementation of Band Split Roformer, SOTA Attention network for music source separation out of ByteDance AI Labs
MIT License
433 stars 16 forks source link

Gates in Attention module of bs_roformer.py #27

Closed Psarpei closed 11 months ago

Psarpei commented 11 months ago

I am a bit confused of the gates in the Attention module of the bs_roformer.py. The code in lines 103-105 is

out = self.attend(q, k, v)
gates = self.to_gates(x)
out = out * rearrange(gates, 'b n h -> b h n 1').sigmoid()

out = rearrange(out, 'b h n d -> b n (h d)')
return self.to_out(out)

From my understanding this is not the standard multi head attention approach and the paper has not mentioned anything about using something else. Therefore, I would remove the parts using gates resulting in the following code:

out = self.attend(q, k, v)

out = rearrange(out, 'b h n d -> b n (h d)')
out = self.to_out(out)

What got I wrong? What is the gates for and why is it used can you clear this up?

lucidrains commented 11 months ago

@Psarpei i'm just applying some recent attention research i believe in

lucidrains commented 11 months ago

incidentally, this attention formulation was also used in the successful alphafold2

Psarpei commented 11 months ago

Thanks for your fast reply! I will check the paper out :) So you believe its always better to include the part of the gates in MHA and MHCA as well?

lucidrains commented 11 months ago

@Psarpei i like to either include head-wise gating, or a few memory key / values, if full memory / register tokens cannot be used

all of these engineering choices are addressing one of the underlying issues in attention

Psarpei commented 11 months ago

Okay thanks very much I will definitely check both out and try to understand when it's applicable and helpful :)