Closed gulnazaki closed 3 years ago
@gulnazaki Oh hello! You have wandered from performer to here lol
I think that would make sense, but the affine transform from the final layer norm should take care of the temperature (unless if I am mistakened in my intuition)
I'll run an experiment later today on https://github.com/lucidrains/x-transformers and see if there is any big differences and make the change if so!
Haha yep, I wanted to do some comparisons.
Actually, that is not Reformer specific, but I raised the issue here because for some reason I remembered about embedding scaling while reading the code. I am not sure either if it would make a difference, because the original Transformer is Post-LN, but who knows.
Thanks and let me know if it helps for the tied embedding case.
@gulnazaki yea, I've actually added a post-LN norm at the very end of all the layers of attention, for all my repositories, after noticing this was done for GPT-3 (and some other papers)
yeah will do!
@gulnazaki I just tried it and it wasn't great
I think the only thing to try is Fixnorm
as presented in this paper https://arxiv.org/abs/1910.05895 But I doubt there will be any substantial gain, as opposed to longer contexts and scaling up
Hi there,
I've seen that in the original Tranformer paper and in many implementations the weights of the embedding layers are multiplied by the square root of the model dimension when tying the embeddings. I am not sure if this would be beneficial to the Reformer architecture (or other transformer derivatives), have you experimented with this or have more insight on this matter?
Thanks