titu1994 / keras-one-cycle

Implementation of One-Cycle Learning rate policy (adapted from Fast.ai lib)
MIT License
285 stars 78 forks source link

Why momentum can only be updated with SGD optimizer ? #1

Closed 0ruben closed 6 years ago

0ruben commented 6 years ago

Hi titu1994,

I'd like to thank you for this work. I was wondering why momentum update was only restrained to SGD optimizer. I wanted to try it with Adam or AdamW for example.

Thanks !

titu1994 commented 6 years ago

Adam has an adaptive momentum parameter calculated using the "V" and "V_hat" parameters. Therefore changing momentum manually hurts performance.

0ruben commented 6 years ago

From what I've read the only issue using this technique with Adam is related to the weight decay. AdamW should be ok with this method. http://www.fast.ai/2018/07/02/adam-weight-decay/ Maybe I am missing something here ...

titu1994 commented 6 years ago

If you look at the equations for Adam, it uses a set of weights for each parameter in computing the moving average of the parameter squared and is in the denominator. This is the momentum part of Adam (Adaptive Momentum).

Since the momentum is learned on it's own, if we apply cyclic LR to this, it causes divergence in training at worse, and simply horrible results at best.