Closed 0ruben closed 6 years ago
Adam has an adaptive momentum parameter calculated using the "V" and "V_hat" parameters. Therefore changing momentum manually hurts performance.
From what I've read the only issue using this technique with Adam is related to the weight decay. AdamW should be ok with this method. http://www.fast.ai/2018/07/02/adam-weight-decay/ Maybe I am missing something here ...
If you look at the equations for Adam, it uses a set of weights for each parameter in computing the moving average of the parameter squared and is in the denominator. This is the momentum part of Adam (Adaptive Momentum).
Since the momentum is learned on it's own, if we apply cyclic LR to this, it causes divergence in training at worse, and simply horrible results at best.
Hi titu1994,
I'd like to thank you for this work. I was wondering why momentum update was only restrained to SGD optimizer. I wanted to try it with Adam or AdamW for example.
Thanks !