Closed veritas9872 closed 1 year ago
I have figured it out. The initial in-place update allows the model to simply add the gradients later on, which should solve the issue. I am still not sure whether modifying gradients in-place is a good idea considering that the documentation explicitly discourages this.
The current implementation of
WeightDecay
usesparam.grad = self.regularize(param)
to calculate the decayed gradient, which has two problems. First, the decayed gradient is incorrect because the original gradient is replaced instead of having a value added to it. The correct answer should beparam.grad += self.regularize(param)
. Second, the PyTorchhook
API does not allow in-place modification but instead allows the hook to optionally return a new gradient. I think that returning he new gradient value would be correct.