wangqiangneu / MT-PaperReading

Record my paper reading about Machine Translation and other related works.
36 stars 2 forks source link

20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth #55

Open wangqiangneu opened 4 years ago

wangqiangneu commented 4 years ago

简介

对残差连接的一种改进,能够更快的训练更深的网络。做法很简单。以transformer为例,移除layernorm,$y=x+a*F(x)$即可,其中a需要初始化成0。做法上很像DLCLDLCL中也在每层中使用了learnable scalar,不同地方在于,DLCL强调的是跟以往更多层之间的关系,ReZero要学的参数是O(L),DLCL学的是O(L^2)。另外,ReZero在transformer的每层中,所有子层都做这个scaled的操作,其中a是所有子层共享的。而DLCL是只在中做,不包括子层。

论文信息

总结

hukkai commented 4 years ago

I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale https://github.com/hongyi-zhang/Fixup/issues/6#issuecomment-506460413

wangqiangneu commented 4 years ago

I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale hongyi-zhang/Fixup#6 (comment)

Good point! Maybe using a separated and small learning rate for those scalars is the key.