szymonmaszke / torchlayers

Shape and dimension inference (Keras-like) for PyTorch layers and neural networks
https://szymonmaszke.github.io/torchlayers/
MIT License
568 stars 46 forks source link

Is the WeightDecay implementation correct? #17

Closed veritas9872 closed 1 year ago

veritas9872 commented 1 year ago

The current implementation of WeightDecay uses param.grad = self.regularize(param) to calculate the decayed gradient, which has two problems. First, the decayed gradient is incorrect because the original gradient is replaced instead of having a value added to it. The correct answer should be param.grad += self.regularize(param). Second, the PyTorch hook API does not allow in-place modification but instead allows the hook to optionally return a new gradient. I think that returning he new gradient value would be correct.

veritas9872 commented 1 year ago

I have figured it out. The initial in-place update allows the model to simply add the gradients later on, which should solve the issue. I am still not sure whether modifying gradients in-place is a good idea considering that the documentation explicitly discourages this.