wangqiangneu / MT-PaperReading

Record my paper reading about Machine Translation and other related works.
36 stars 2 forks source link

19-EMNLP-Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention #2

Open wangqiangneu opened 5 years ago

wangqiangneu commented 5 years ago

简介

训练deep Transformer. 是在post-norm方式的基础上,通过改变参数初始化的方式实现的。Vanilla Transformer的参数(e.g. Linear Layer)初始化是$U(-r, r), r=\sqrt \frac{6}{f{in}+f{out}}$,现在改为$r'=\frac{r}{\sqrt{l}}$, l是当前层的层数,这样高层的输出variance会减小,不容易发生gradient vanishing.

论文信息

总结