Open EBazarov opened 5 years ago
Hi, good idea. We're aware of AdamW, the main modification is tiny, though the updated version that is 'compatible' with SGDR (annealing) is more complicated, see https://github.com/pytorch/pytorch/pull/4429#discussion_r248627341
In the meantime, it is recommended you use AMSGRAD
instead of ADAM
everywhere, though don't expect better results overall, just a fix to some settings, see https://fdlm.github.io/post/amsgrad/
As a remainder, SGDR is also implemented, look at #377. As SGDR automatically schedules the learning rate, you may not need ADAM actually, though training may take longer on average due to the annealing cycles.
We should use weight decay with Adam (they call it AdamW), and not the L2 regularization that classic deep learning libraries implement. As soon as we add momentum, or use a more sophisticated optimizer like Adam, L2 regularization and weight decay become different, when it's the same when applying vanilla SGD. Described explanation you can find here https://www.fast.ai/2018/07/02/adam-weight-decay/#adamw
And paper here: https://arxiv.org/pdf/1711.05101.pdf