Code for the ICML'20 paper "Improving Transformer Optimization Through Better Initialization"
MIT License
89
stars
11
forks
source link
Does adding layer norm together with t-fixup makes the model even better or does t-fixup make layer norm completely unnecessary (i.e. no performance gain)? #7
I do not seem to get the comparison between t-fixup vs t-fixup + layer norm in the paper. Hopefully you have some insights into this and can answer. Thanks.
I do not seem to get the comparison between t-fixup vs t-fixup + layer norm in the paper. Hopefully you have some insights into this and can answer. Thanks.