spot bug in SGDW implementation (weight decay part)

Hi,

I was using the SGDW implementation in this repo, and I wonder if anything is wrong with this line:

https://github.com/jettify/pytorch-optimizer/blob/910b414565427f0a66e20040475e7e4385e066a5/torch_optimizer/sgdw.py#L121

Let weight decay be $\lambda$ and learning rate be $\mu_t$. If I understand it correctly, this line of code update weight decay with $$\theta_t \leftarrow \tilde{\theta}_t - \lambda \mu_t$$ where (follow the notation in the paper)

$$\tilde{\theta}_t \leftarrow \theta_{t-1} - m_t$$

But it should be

$$ \begin{aligned} \theta{t-1} &\leftarrow \theta{t-1} \cdot (1 - \lambda \mu_t) \ \thetat &\leftarrow \theta{t-1} - m_t \end{aligned} $$

as in the paper:

This result in poor performance of training compared to SGD with the same set of optimization hyper-parameter.

Thanks!

Regards, Liu

jettify / pytorch-optimizer

spot bug in SGDW implementation (weight decay part) #454