Extra FF when using cross attention

lucidrains / performer-pytorch

An implementation of Performer, a linear attention-based transformer, in Pytorch

MIT License

1.08k stars 141 forks source link

Extra FF when using cross attention #56

Closed gulnazaki closed 3 years ago

gulnazaki commented 3 years ago

Hello Phil,

I have noticed that when using cross attention a new block (with attention and a FeedForward layer is added), while only an attention layer should be added between the self attention and the FF layer.

Is there any reason for this?

lucidrains commented 3 years ago

@gulnazaki Good catch! it's actually because of the way I have things setup with reversibility. You are right that an extra feedforward isn't faithful to the original design, but in practice I think it makes little difference. Perhaps it may even improve things https://arxiv.org/abs/1906.02762 I'll see what I can do tomorrow

gulnazaki commented 3 years ago

Wow crazy paper, thanks for sharing. I couldn't argue about the extra FF sublayer effect, because I couldn't find any paper/architecture using it yet.

I also couldn't find a delicate way to patch this (because of the reversibility mostly), so let me know if you look at it :)

I will do a comparison and there could be some interesting findings about it!

gulnazaki commented 3 years ago

Also, note that in the paper the cross_attend scheme for the layer is: FF, self-attention, cross-attention, FF instead of self-attention, FF, cross-attention, FF. And there is a 1/2 factor for the two FFs, with a inner dimension of 2d instead of 4d(that is for fair comparison mostly though).

lucidrains commented 3 years ago

@gulnazaki yup, they scale their feedforwards by 1/2 in that paper, but we have other papers that show an excess of feedforwards does no harm, may even bring benefits https://arxiv.org/abs/2009.04534

I'll rewrite the sequential logic to the traditional method though, just to temper confusion :)

lucidrains commented 3 years ago

another paper to read to make you question dogma a bit more https://arxiv.org/abs/1911.03864

gulnazaki commented 3 years ago

Don't worry, I just thought you hadn't noticed.

Interesting stuff though, it is a jungle out there

gulnazaki commented 3 years ago

I implemented the original architecture for sequential on my fork (there is the ability to add two cross attention sublayers for multisource stuff), if you want to take a look.

https://github.com/gulnazaki/performer-pytorch/commit/4346b648ed6a2aaee28a2e4d977e2a11a0fb5227

So, I am closing this

lucidrains commented 3 years ago

@gulnazaki amazing! :D