Closed gulnazaki closed 3 years ago
@gulnazaki Good catch! it's actually because of the way I have things setup with reversibility. You are right that an extra feedforward isn't faithful to the original design, but in practice I think it makes little difference. Perhaps it may even improve things https://arxiv.org/abs/1906.02762 I'll see what I can do tomorrow
Wow crazy paper, thanks for sharing. I couldn't argue about the extra FF sublayer effect, because I couldn't find any paper/architecture using it yet.
I also couldn't find a delicate way to patch this (because of the reversibility mostly), so let me know if you look at it :)
I will do a comparison and there could be some interesting findings about it!
Also, note that in the paper the cross_attend scheme for the layer is: FF, self-attention, cross-attention, FF
instead of self-attention, FF, cross-attention, FF
. And there is a 1/2 factor for the two FFs, with a inner dimension of 2d instead of 4d(that is for fair comparison mostly though).
@gulnazaki yup, they scale their feedforwards by 1/2 in that paper, but we have other papers that show an excess of feedforwards does no harm, may even bring benefits https://arxiv.org/abs/2009.04534
I'll rewrite the sequential logic to the traditional method though, just to temper confusion :)
another paper to read to make you question dogma a bit more https://arxiv.org/abs/1911.03864
Don't worry, I just thought you hadn't noticed.
Interesting stuff though, it is a jungle out there
I implemented the original architecture for sequential on my fork (there is the ability to add two cross attention sublayers for multisource stuff), if you want to take a look.
https://github.com/gulnazaki/performer-pytorch/commit/4346b648ed6a2aaee28a2e4d977e2a11a0fb5227
So, I am closing this
@gulnazaki amazing! :D
Hello Phil,
I have noticed that when using cross attention a new block (with attention and a FeedForward layer is added), while only an attention layer should be added between the self attention and the FF layer.
Is there any reason for this?