Do the weight decay before using grad

juntang-zhuang / Adabelief-Optimizer

Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"

BSD 2-Clause "Simplified" License

1.05k stars 108 forks source link

Do the weight decay before using grad #33

Closed vpj closed 3 years ago

vpj commented 3 years ago

Uncoupled weight decay was done after gradient is used to calculate momentum and variance. Fixed it. Found this while writing a tutorial implementation.

juntang-zhuang commented 3 years ago

What’s the difference between doing weight decay before or after grad? I think they are equivalent, when considering update before grad in step 2, is equivalent to update after grad in step 1.

juntang-zhuang commented 3 years ago

Thanks for the implementation. Quickly skim through your code, the eps is used differently from our implementation. Not sure how much difference it would cause.

vpj commented 3 years ago

@juntang-zhuang grad_residual is computed before uncoupled weight decay changes the gradient https://github.com/juntang-zhuang/Adabelief-Optimizer/blob/bdb1c313ee7421c9ae526dd1693ac7b2522d25ce/pypi_packages/adabelief_pytorch0.1.0/adabelief_pytorch/AdaBelief.py#L164

Uncoupled weight decay has no effect (won't work) if it's done after calculating the grad_residual

vpj commented 3 years ago

The tutorial has two options of how to use eps. One with the optimization of calculating step size with scalars first before multiplying and dividing by the momentum and variance tensors. https://lab-ml.com/labml_nn/optimizers/radam.html#section-22

The other where we first calculate the denominator with epsilon, which I think is equivalent to yours. https://lab-ml.com/labml_nn/optimizers/radam.html#section-25

juntang-zhuang commented 3 years ago

@vpj Thanks for answering. Regarding the eps, eps is actually used twice in our algorithm in each step, it appears both within and outside the sqrt of st. Please see updated readme or paper on arxiv, a comparison of algorithm between Adam and AdaBelief is shown, with difference highlighed in blue (There are two differences). Though I'm not sure how much difference the extra eps will cause. It seems your code only uses eps once.

Regarding the decoupled weight decay, I still don't understand why you said it does not take effect after calculating grad_residual. Decoupled weight decay is basically multiply the weight by a constant factor smaller than 1, it's not related to the gradient. I don't think there will be such a big difference, it you consider the optimization process as "update - rescale - update -rescale ..." v.s. "rescale - update - rescale - update", by "update" I mean update only using the gradient, not the weight.

vpj commented 3 years ago

@juntang-zhuang Thanks. Sorry I hadn't noticed the two uses of epsilon, will change my code.

About the weight decay. Again my bad, I had been referring to coupled weight decay grad.add_(p.data, alpha=group['weight_decay']) where you change the gradient. But grad_residual and exp_avg is calculated before that.

juntang-zhuang commented 3 years ago

@vpj Thanks for clarification. I see, that's an error with coupled weight decay in new version of the code, thanks for pointing out. Will correct it in the next release.

vpj commented 3 years ago

awesome. while fixing my code I noticed there might be an issue here https://github.com/juntang-zhuang/Adabelief-Optimizer/blob/bdb1c313ee7421c9ae526dd1693ac7b2522d25ce/pypi_packages/adabelief_pytorch0.1.0/adabelief_pytorch/AdaBelief.py#L175

The epsilon is added and assigned to exp_avg_var, which is not the expected behavior.

vpj commented 3 years ago

Fixed my code https://lab-ml.com/labml_nn/optimizers/ada_belief.html#section-24 thanks!

juntang-zhuang commented 3 years ago

@vpj Thanks a lot !

vpj commented 3 years ago

I saw this in previous version also, since this gets accumulated wouldn't this be a significant numerical difference?

juntang-zhuang commented 3 years ago

It does cause numerical difference, and the inplace version is tested in experiments but the non-inplace is not. That's why we prefer to keep current version unless many experiments show non-inplace helps.

juntang-zhuang commented 3 years ago

@vpj Fix the coupled weight decay in adabelief-pytorch==0.2.0, now it can be installed by pip. Thanks a lot.