Why does g_t substract m_t, instead of m_{t-1} ?

Dear authors, Thanks for providing such a good implementation, and I benefit a lot from the repo in my experiments. I have a question for the update of s_t in the algorithm as titled.

In my task, (g_t - m_t)^2 gives a contractive result against (g_t - m_{t-1})^2 on the choice of different betas. Specifically, the original update (g_t - m_t)^2 suggests a greater beta2 is better (0.999 other than 0.98), while the revised version (g_t - m_{t-1})^2 shows 0.98 is a better beta2.

Other parameters are kept the same as the default. The code version I use is pytorch-0.2.0. To name some of them, lr=1e-3, eps=1e-16, weight_decay=0.1, weight_decoupled=True, amsgrad=False, fixed_decay=False, rectify=True.

To compare with Adam and RAdam, rectify set as False is also tested. The contraction still occurs for the original and revised update of s_t (however, at this time the better beta2 is reversed).

I know the parameter tuning lacks much sufficient evidence to make a convincing conclusion, so I just wonder why (g_t - m_t)^2 is used? Since (gt - m{t-1})^2 will compare the gradient of the current step with previous moving average, I guess it's more intuitive.

Thanks for reading my question. Wish you a good day :)

juntang-zhuang / Adabelief-Optimizer

Why does g_t substract m_t, instead of m_{t-1} ? #48