Does adding layer norm together with t-fixup makes the model even better or does t-fixup make layer norm completely unnecessary (i.e. no performance gain)?

layer6ai-labs / T-Fixup

Code for the ICML'20 paper "Improving Transformer Optimization Through Better Initialization"

MIT License

89 stars 11 forks source link

Does adding layer norm together with t-fixup makes the model even better or does t-fixup make layer norm completely unnecessary (i.e. no performance gain)? #7

Open yxchng opened 3 years ago

yxchng commented 3 years ago

I do not seem to get the comparison between t-fixup vs t-fixup + layer norm in the paper. Hopefully you have some insights into this and can answer. Thanks.