Epsilon is important to Adaptive Optimizer

juntang-zhuang / Adabelief-Optimizer

Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"

BSD 2-Clause "Simplified" License

1.05k stars 109 forks source link

Thanks a lot for the nice experiments. This could point to a new direction I have not extensively explored.

It’s possible that the gradient has a mean very close to 0 ( perhaps batchnorm does some centralization in the gradient), in this case the second momentum is dominated by variance and both ways of treating eps are similar. Could you try EAdam on SN-GAN? Perhaps grad in this case does not have a zeros mean (I’m not sure, just guess)

It’s also possible that eps is large compared to Gt^2, and the denominator is dominated by eps.

The third possible reason is that s_t and v_t are truly bounded below after adding eps, which matches the assumption of theoretical proof.

Thanks for the nice experiments. Could be an important supplement to the paper. Due limited GPU, I was not able to run on large datasets such as COCO, it’s very nice that you reported new results. Good to see that AdaBelief and EAdam outperforms others in more experiments.

juntang-zhuang / Adabelief-Optimizer

Epsilon is important to Adaptive Optimizer #24