Add ReZero and ScaleNorm support

tomweingarten commented 3 years ago

Also engaging in some poor PR hygiene by fixing a simple bug with the ff_activation parameter.

lucidrains commented 3 years ago

lgtm!

lucidrains commented 3 years ago

@tomweingarten what configuration have you had the most luck with? rezero or scalenorm?

tomweingarten commented 3 years ago

@lucidrains I haven't run any long studies yet but my initial results show ScaleNorm to be faster to converge

I haven't seen any cases with either diverging. The Adafactor optimizer seems to work very well at keeping both stable.

lucidrains commented 3 years ago

@tomweingarten Yes, I believe I have noticed the same, last night, it came to me there is a connection between scale norm and https://arxiv.org/abs/2003.07845 , where they relax the zero-meaning

lucidrains commented 3 years ago

@tomweingarten very interesting for Rezero! I have noticed divergence on bigger datasets (common crawl), but I shall try it again given your testimony and see if some more aggressive gradient clipping can fix that

tomweingarten commented 3 years ago

Looking forward to hear how it works for you! I'd also recommend either A) using an optimizer with variable learning rates like Adafactor or B) using a separate learning rate for the residual weights. Otherwise even with gradient clipping you can see divergence caused by the momentum over multiple steps.

On Wed, Oct 21, 2020 at 4:03 PM Phil Wang notifications@github.com wrote:

@tomweingarten https://github.com/tomweingarten very interesting for Rezero! I have noticed divergence on bigger datasets (common crawl), but I shall try it again given your testimony and see if some gradient clipping can fix that

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lucidrains/routing-transformer/pull/14#issuecomment-713841513, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2OD23CF2UD7IWE2PDOAVDSL4473ANCNFSM4SZFXLBQ .

lucidrains commented 3 years ago

@tomweingarten yes, you did allude to this different learning rate in some footnote in the rezero paper, i'll reread it tonight. thanks!

lucidrains / routing-transformer

Add ReZero and ScaleNorm support #14