juntang-zhuang / Adabelief-Optimizer

Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"
BSD 2-Clause "Simplified" License
1.05k stars 108 forks source link

recommended experiments #21

Closed dvolgyes closed 3 years ago

dvolgyes commented 3 years ago

Hi,

There is an obvious question, i think it would be nice to address it in the final presentation, paper, etc. In the "A quick look at the algorithm", The "belief" part of Adabelief comes from the g_t^2 -> (g_t - m_t)^2 modification. However, m_t could contain quite a large part of g_t, depending on the momentum weight (beta_1). Wouldn't be more effective to use the m_t-1 value? In most cases, with large momentum, the difference is probably marginal, but there are three obvious outcomes, and it would improve the paper to identify which one applies: The effect of using m_t-1 is 1, marginal 2, makes it more effective 3, makes it less effective

I think it is trivial to run an experiment if you already have the pipeline for the paper.

As another improvement, it would be nice to compare a few different beta_2 values. The momentum for the s/v term (0.999) is quite a high default. Since Adabelief scales in a more smart way than Adam, maybe using a smaller beta_2 makes it reacting/adapting faster, than Adam. E.g. plotting some demos with 0.999, 0.99, 0.95 would be nice. My theory is that Adabelief would be even more effective with smaller beta_2 (a.k.a. the optimal beta_2 is not the same for Adam and Adabelief).

juntang-zhuang commented 3 years ago

Hi, thanks for the comment, they are very good points.

For the first question, I think extra experiments with different learning rate might answers it. Note that m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t, and m_t - g_t = \beta_1 ( m_{t-1} - g_t ). This implies that using a different learning rate (divided by \beta_1) is equivalent to using g_t - m_{t-1}. Some experiment with different learning rates are in the appendix, though only on CIFAR10 dataset, seems different learning rates do not generate a significantly different result. But I have not tested on more examples.

For the second question, it's quite likely that the default \beta values should be set as different from Adam. I have not tested it yet, so don't have a concrete result now. Will try that later or incorporate it into the next version of release.