Is the WeightDecay implementation correct?

szymonmaszke / torchlayers

Shape and dimension inference (Keras-like) for PyTorch layers and neural networks

MIT License

568 stars 46 forks source link

The current implementation of WeightDecay uses param.grad = self.regularize(param) to calculate the decayed gradient, which has two problems. First, the decayed gradient is incorrect because the original gradient is replaced instead of having a value added to it. The correct answer should be param.grad += self.regularize(param). Second, the PyTorch hook API does not allow in-place modification but instead allows the hook to optionally return a new gradient. I think that returning he new gradient value would be correct.

szymonmaszke / torchlayers

Is the WeightDecay implementation correct? #17