20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth

wangqiangneu commented 4 years ago

简介

对残差连接的一种改进，能够更快的训练更深的网络。做法很简单。以transformer为例，移除layernorm，$y=x+a*F(x)$即可，其中a需要初始化成0。做法上很像DLCL，DLCL中也在每层中使用了learnable scalar，不同地方在于，DLCL强调的是跟以往更多层之间的关系，ReZero要学的参数是O(L)，DLCL学的是O(L^2)。另外，ReZero在transformer的每层中，所有子层都做这个scaled的操作，其中a是所有子层共享的。而DLCL是只在层中做，不包括子层。

论文信息

Author: UCSD
Paper
Code

总结

分析input-output的jacobian思路是对的，
方法很简单，可以作为baseline一试；去掉LN也很赞，多少能少算点嘛~
看repo里的issue，似乎有人直接用在gpt上是不好使的，会NAN；作者建议embedding初始化U(-1/d, +1/d)，所以用的时候可能还需要调整

hukkai commented 4 years ago

I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale https://github.com/hongyi-zhang/Fixup/issues/6#issuecomment-506460413

wangqiangneu commented 4 years ago

I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale hongyi-zhang/Fixup#6 (comment)

Good point! Maybe using a separated and small learning rate for those scalars is the key.

wangqiangneu / MT-PaperReading

20-Arxiv-ReZero is All You Need: Fast Convergence at Large Depth #55

简介

论文信息

总结