Open wangqiangneu opened 4 years ago
I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale https://github.com/hongyi-zhang/Fixup/issues/6#issuecomment-506460413
I think the idea of the paper can be found in Fixup initialization where the author explained why there would be some training divergence for scalar scale hongyi-zhang/Fixup#6 (comment)
Good point! Maybe using a separated and small learning rate for those scalars is the key.
简介
对残差连接的一种改进,能够更快的训练更深的网络。做法很简单。以transformer为例,移除layernorm,$y=x+a*F(x)$即可,其中
a
需要初始化成0。做法上很像DLCL
,DLCL
中也在每层中使用了learnable scalar,不同地方在于,DLCL
强调的是跟以往更多层之间的关系,ReZero
要学的参数是O(L),DLCL
学的是O(L^2)。另外,ReZero
在transformer的每层中,所有子层
都做这个scaled的操作,其中a
是所有子层共享的。而DLCL
是只在层
中做,不包括子层。论文信息
总结