Closed arunmallya closed 7 years ago
:zeroGradParameters() is overriden in DataParallelTable, so you cannot replace it with self.gradParams:zero(). optim is intended to operate only on master GPU's copy of parameters, hence it is operating on self.gradParams. zeroGradParameters has to zero gradients of all the replicas.
So, forward distributes the master's copy of params
to every replica and backward sums up gradients from all replicas to the master's copy of gradParams
?
I assume that's the reason adding L2 norm to gradParams
and updating params
doesn't cause an issue.
That's right, forward broadcasts params to all the replicas, and backward sums up the gradients to gradParams.
Great, thanks!
It would be nice if this was added to the documentation as a do-not do this!
Setting the gradParams of a model wrapped in DataParallelTable seems to cause issues. For example: Consider train.lua of https://github.com/facebook/fb.resnet.torch
Replacing
self.model:zeroGradParameters()
withself.gradParams:zero()
causes the loss to go to nan. This is not a problem for models not inside a DataParallelTable.The weird part is that the optim package directly modifies self.gradParams. For example: https://github.com/torch/optim/blob/master/sgd.lua#L48 https://github.com/torch/optim/blob/master/sgd.lua#L65 but this doesn't cause any issues.
I even tried
self.gradParams:mul(0.0)
instead ofself.gradParams:zero()
and that doesn't work. I tried callingself.gradParams:zero()
before the forward and after the backward, and neither worked.The command used was the recipe for multi GPU training on cifar-10:
th main.lua -dataset cifar10 -nGPU 2 -batchSize 128 -depth 20
Is this a syncing issue? If yes, how is the optim package not causing any issue?
FYI, here's the output of the file: