juntang-zhuang / Adabelief-Optimizer

Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"
BSD 2-Clause "Simplified" License
1.05k stars 108 forks source link

degenerated_to_sgd hyperparameter -- background and recommendations? #25

Closed evanatyourservice closed 3 years ago

evanatyourservice commented 3 years ago

Hello and great work! I was wondering about the "degenerated_to_sgd" hyperparameter. Can you explain the background behind it and maybe provide a paper about it if there is one? Also, would you say the recommendations on when to use it are similar to rectify? If not, when do you think it should be used (beneficial all the time or only some of the time)?

juntang-zhuang commented 3 years ago

Hi, the “rectify” and “degenerate to sgd” is the same as in RAdam proposed in the paper “ on the variance of adaptive learning rate and beyond”. If rectify==False, then rectification is turned off and it does not matter whether degenerate is True or False. If rectify==True, I only tested with degenerate==True since it’s recommended in the RAdam paper (so typically I set degenerate==True, and play with rectify = True or False). I’m not sure about a general principle whether to turn on rectify or not. My experience is for experiments where Adam significantly outperforms sgd, such as SN-GAN, and the model is trained for a long time, rectification helps ( but I don’t have a good intuition for that)

evanatyourservice commented 3 years ago

Oh okay, thank you!