juntang-zhuang / Adabelief-Optimizer

Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"
BSD 2-Clause "Simplified" License
1.05k stars 109 forks source link

Update ImageNet default weight decay #30

Closed arrufat closed 3 years ago

arrufat commented 3 years ago

Hi, thank you for publishing your work!

I was implementing this optimizer in my dlib branch and I ran some tests on a dataset very similar to ImageNet. I used the default parameters listed here and I the loss was decreasing really slowly.

Then I noticed there is a discrepancy between the default weight decay in the README (1e-2) and the one in the paper (1e-4), so I used 1e-4 and it worked. Is this a typo?

juntang-zhuang commented 3 years ago

I think it’s due to weight decouple is turned on for imagenet. I'm pretty sure the weight decay is 1e-2 for Imagenet. Which version of pytorch are you using?

arrufat commented 3 years ago

Hi, thank you for the quick reply. I've re-run the experiment, and I am still facing a non-convergence issue. I am not using PyTorch, I implemented the Adabelief optimizer in dlib.

But maybe I missed something, I'll keep digging

juntang-zhuang commented 3 years ago

Thanks for the reply. I quickly skimmed over, a few differences might be the cause: (1) eps is actually used twice in PyTorch version, one inside sqrt one outside sqrt of vt in denominator. (2) eps is used in an in-place operation, besides normal update of vt, vt actually is added by an extra eps (so for N-th step, N eps are added) (3) The weight decay in this implementation seems not to be de-coupled, if I understand correctly, weight decay is applied to update gradient, then the gradient is used both in numerator and denominator. The decoupled decay should be performed outside the update by gradient, simply multiply the weight by 1-weight_decay x lr. (4) For decoupled weight decay, the actual decay is weight_decay x lr, and Ada-optimizers uses lr=1e-3, much smaller than 1e-1 for SGD, that's why the decoupled weight decay is so large for Ada-optimizers. Hopefully these could help with your experiments. Please leave a note here if you have new ideas. Thanks again for trying out AdaBelief.

arrufat commented 3 years ago

Thank you for reaching out again and giving me those suggestions. I totally missed the fact that the epsilon was used twice (I did not notice it was blue-colored in the paper...) I will check the other points and re-run some experiments (might take a while)

Thank you... I will close the PR, since it no longer makes sense.