jolibrain / deepdetect

Deep Learning API and Server in C++14 support for Caffe, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE
https://www.deepdetect.com/
Other
2.52k stars 560 forks source link

Support for AdamW gradient method #541

Open EBazarov opened 5 years ago

EBazarov commented 5 years ago

We should use weight decay with Adam (they call it AdamW), and not the L2 regularization that classic deep learning libraries implement. As soon as we add momentum, or use a more sophisticated optimizer like Adam, L2 regularization and weight decay become different, when it's the same when applying vanilla SGD. Described explanation you can find here https://www.fast.ai/2018/07/02/adam-weight-decay/#adamw

And paper here: https://arxiv.org/pdf/1711.05101.pdf

beniz commented 5 years ago

Hi, good idea. We're aware of AdamW, the main modification is tiny, though the updated version that is 'compatible' with SGDR (annealing) is more complicated, see https://github.com/pytorch/pytorch/pull/4429#discussion_r248627341

In the meantime, it is recommended you use AMSGRAD instead of ADAM everywhere, though don't expect better results overall, just a fix to some settings, see https://fdlm.github.io/post/amsgrad/

As a remainder, SGDR is also implemented, look at #377. As SGDR automatically schedules the learning rate, you may not need ADAM actually, though training may take longer on average due to the annealing cycles.