juntang-zhuang / Adabelief-Optimizer

Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"
BSD 2-Clause "Simplified" License
1.05k stars 108 forks source link

Is extra epsilon more important than belief? #23

Closed yasutoshi closed 3 years ago

yasutoshi commented 3 years ago

Hello,

congratulations on being accepted to NeurIPS and thank you for sharing the code. I'm enjoying playing with this code.

I found that the arXiv paper has been updated from v1 to v2. In v2, the extra epsilon in the bias correlation has been added.

I removed the extra epsilon from this code to investigate the effects of "belief" only.

https://gist.github.com/yasutoshi/39f1b74af9bc0cf504fa678917383ef8#file-adabelief_noepsilon-py-L161

As a result, Adam and AdaBelief were only about the same accuracy in the experiment on Cifar10 with ResNet.

Does this mean that the performance improvement of AdaBelief is due to the extra epsilon and not belief?

I would be grateful if you could tell me if I was wrong.

juntang-zhuang commented 3 years ago

Hi, it’s hard to say which is more important. In cifar experiment, eps is set to a very large number as 1e-8 vs default 1e-16, so removing it is a big difference. But in other experiments such as GAN and transformer, using the default 1e-16, it should not be a big difference. Also when you remove the eps inside the sqrt, you should add a bigger eps( roughly the eps after sqrt) outside the sqrt which is roughly 1e-4.

It’s hard to say what is the most important, because many factors are included in one training, such as eps, beta, learning rate schedule, centered second momentum, setting one hyperparam to a bad value will cause bad results, but it’s hard to say it’s more important than other parameters.

Btw, in arXiv v1 we omit the bias correction term due to shortage of paper length for neurips submission, the entire algorithm is in appendix and not changed.

Update: it just occurs to me that we have tested AdaBelief and Adam under different eps and learning rates, see figure 4 and 5 in appendix, AdaBelief is robust to different eps in cifar classification task.

yasutoshi commented 3 years ago

Thank you for the response!

Also when you remove the eps inside the sqrt, you should add a bigger eps( roughly the eps after sqrt) outside the sqrt which is roughly 1e-4.

Update: it just occurs to me that we have tested AdaBelief and Adam under different eps and learning rates, see figure 4 and 5 in appendix, AdaBelief is robust to different eps in cifar classification task.

Does this mean that it is robust to different eps when the extra eps inside the sqrt is present, but it is not robust when the extra eps inside the sqrt is absent (as is the results I shared)?

In other words, does the extra eps contribute to increased robustness to the eps?

But in other experiments such as GAN and transformer, using the default 1e-16, it should not be a big difference.

I see. I will try GAN without the extra eps and share the result. (So it would be nice if you could reopen the issue.)

Btw, in arXiv v1 we omit the bias correction term due to shortage of paper length for neurips submission, the entire algorithm is in appendix and not changed.

I'm sorry, I missed that.

Btw, I have proposed SDProp (IJCAI 2017) that uses a similar idea to AdaBelief.

https://www.ijcai.org/Proceedings/2017/0267.pdf

I think that SDProp divides the gradient by belief while AdaBelief divides the momentum by belief.

sdprop_ijcai

I would be glad if you would cite our paper.

(Since SDProp also improved its performance by adding the extra eps in the sqrt, I thought the role of the extra eps might be important and made this issue.)

juntang-zhuang commented 3 years ago

Does this mean that it is robust to different eps when the extra eps inside the sqrt is present, but it is not robust when the extra eps inside the sqrt is absent (as is the results I shared)?

I'm not sure if we can draw this conclusion. Even with the extra experiments in the appendix, the scale of eps is not too large (1e-4 to 1e-9 while default is 1e-8.

If only use the eps outside the sqrt, then the range should be 1e-2 to 1e-4.5, which is larger compared to the default. Only use eps=1e-8 outside the sqrt would be too far away from the range 1e-2 ~ 1e-4.5 in the paper. Perhaps when experimenting with eps_in_sqrt and eps_out_sqrt, we should use different log scales for a fair comparison. But eps is definitely an important parameter for training

Sorry we were not aware of SDProp before submission. I'll add reference to SDProp in the update of arXiv paper, thanks for pointing it out.

juntang-zhuang commented 3 years ago

@yasutoshi Added reference in arXiv version, should be released soon.