Scaling by sqrt(dim) when using tied embeddings

lucidrains / reformer-pytorch

Reformer, the efficient Transformer, in Pytorch

MIT License

2.13k stars 256 forks source link

Scaling by sqrt(dim) when using tied embeddings #124

Closed gulnazaki closed 3 years ago

gulnazaki commented 3 years ago

Hi there,

I've seen that in the original Tranformer paper and in many implementations the weights of the embedding layers are multiplied by the square root of the model dimension when tying the embeddings. I am not sure if this would be beneficial to the Reformer architecture (or other transformer derivatives), have you experimented with this or have more insight on this matter?

Thanks

lucidrains commented 3 years ago

@gulnazaki Oh hello! You have wandered from performer to here lol

I think that would make sense, but the affine transform from the final layer norm should take care of the temperature (unless if I am mistakened in my intuition)

I'll run an experiment later today on https://github.com/lucidrains/x-transformers and see if there is any big differences and make the change if so!

gulnazaki commented 3 years ago

Haha yep, I wanted to do some comparisons.

Actually, that is not Reformer specific, but I raised the issue here because for some reason I remembered about embedding scaling while reading the code. I am not sure either if it would make a difference, because the original Transformer is Post-LN, but who knows.

Thanks and let me know if it helps for the tied embedding case.

lucidrains commented 3 years ago

@gulnazaki yea, I've actually added a post-LN norm at the very end of all the layers of attention, for all my repositories, after noticing this was done for GPT-3 (and some other papers)

yeah will do!

lucidrains commented 3 years ago

@gulnazaki I just tried it and it wasn't great

I think the only thing to try is Fixnorm as presented in this paper https://arxiv.org/abs/1910.05895 But I doubt there will be any substantial gain, as opposed to longer contexts and scaling up