DataParallelTable Bug or intended behavior?

arunmallya commented 7 years ago

Setting the gradParams of a model wrapped in DataParallelTable seems to cause issues. For example: Consider train.lua of https://github.com/facebook/fb.resnet.torch

Replacing self.model:zeroGradParameters() with self.gradParams:zero() causes the loss to go to nan. This is not a problem for models not inside a DataParallelTable.

The weird part is that the optim package directly modifies self.gradParams. For example: https://github.com/torch/optim/blob/master/sgd.lua#L48 https://github.com/torch/optim/blob/master/sgd.lua#L65 but this doesn't cause any issues.

I even tried self.gradParams:mul(0.0) instead of self.gradParams:zero() and that doesn't work. I tried calling self.gradParams:zero() before the forward and after the backward, and neither worked.

The command used was the recipe for multi GPU training on cifar-10: th main.lua -dataset cifar10 -nGPU 2 -batchSize 128 -depth 20

Is this a syncing issue? If yes, how is the optim package not causing any issue?

FYI, here's the output of the file:

=> Creating model from file: models/resnet.lua
 | ResNet-20 CIFAR-10
=> Training epoch # 1
 | Epoch: [1][1/391]    Time 2.874  Data 0.029  Err 2.7079  top1  87.500  top5  48.438
 | Epoch: [1][2/391]    Time 0.022  Data 0.001  Err nan  top1  90.625  top5  50.000
 | Epoch: [1][3/391]    Time 0.020  Data 0.000  Err nan  top1  85.156  top5  48.438

ngimel commented 7 years ago

:zeroGradParameters() is overriden in DataParallelTable, so you cannot replace it with self.gradParams:zero(). optim is intended to operate only on master GPU's copy of parameters, hence it is operating on self.gradParams. zeroGradParameters has to zero gradients of all the replicas.

arunmallya commented 7 years ago

So, forward distributes the master's copy of params to every replica and backward sums up gradients from all replicas to the master's copy of gradParams? I assume that's the reason adding L2 norm to gradParams and updating params doesn't cause an issue.

ngimel commented 7 years ago

That's right, forward broadcasts params to all the replicas, and backward sums up the gradients to gradParams.

arunmallya commented 7 years ago

Great, thanks! It would be nice if this was added to the documentation as a do-not do this!

torch / cunn

DataParallelTable Bug or intended behavior? #457