Closed Spnetic-5 closed 1 year ago
Thanks @Spnetic-5. I believe it's correct now. In your original implementation, the L2 regularization was not accounted for in the accumulation of the squared gradients because you applied it later in the param update. The learning rate decay was also doubly accounted for because in each step the learning rate should be amortized relative to the original learning rate, not the one from the previous step. Subtle differences that weren't caught in the tests.
I'll go ahead and merge, please release v0.15.0 when you get a chance.
Reference: PyTorch Docs