juntang-zhuang / Adabelief-Optimizer

Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"
BSD 2-Clause "Simplified" License
1.05k stars 108 forks source link

Why does g_t substract m_t, instead of m_{t-1} ? #48

Closed zxteloiv closed 3 years ago

zxteloiv commented 3 years ago

Dear authors, Thanks for providing such a good implementation, and I benefit a lot from the repo in my experiments. I have a question for the update of s_t in the algorithm as titled.

In my task, (g_t - m_t)^2 gives a contractive result against (g_t - m_{t-1})^2 on the choice of different betas. Specifically, the original update (g_t - m_t)^2 suggests a greater beta2 is better (0.999 other than 0.98), while the revised version (g_t - m_{t-1})^2 shows 0.98 is a better beta2.

Other parameters are kept the same as the default. The code version I use is pytorch-0.2.0. To name some of them, lr=1e-3, eps=1e-16, weight_decay=0.1, weight_decoupled=True, amsgrad=False, fixed_decay=False, rectify=True.

To compare with Adam and RAdam, rectify set as False is also tested. The contraction still occurs for the original and revised update of s_t (however, at this time the better beta2 is reversed).

I know the parameter tuning lacks much sufficient evidence to make a convincing conclusion, so I just wonder why (g_t - m_t)^2 is used? Since (gt - m{t-1})^2 will compare the gradient of the current step with previous moving average, I guess it's more intuitive.

Thanks for reading my question. Wish you a good day :)

juntang-zhuang commented 3 years ago

Hi, thanks for your question. There's no specific reason for choose gt - mt or gt-m_{t-1}, I just randomly pick one. In fact, consider mt = beta m{t-1} + (1-beta) g_t, then g_t - m_t = \beta (gt - m{t-1}). As long as \beta is close to 1, there's not so much difference. I think the difference will be larger when \beta is small, although I still have no idea which one is better. Somehow I suspect the optimal choice of betas are different from default in Adam, however this requires too much computation power to find a choice that's good for most tasks, so I did not perform extensive search.