zero grad params on initialization

torch / nn

Other

1.34k stars 967 forks source link

zero grad params on initialization #484

Open willwhitney opened 9 years ago

willwhitney commented 9 years ago

th> lin = nn.Linear(2,2)
th> p1, gp1 = lin:getParameters()
th> p1
 0.4611
-0.6737
-0.6769
 0.3312
-0.3065
-0.0952
[torch.DoubleTensor of size 6]

Is there a reason grad params don't start zeroed when a module is initialized? This seems super dangerous, and since initialization only happens once, it's not like it's a big performance hit to zero them.

th> gp1
-2.6816e+154
-2.6816e+154
 2.9644e-323
 2.7813e-309
-2.6816e+154
-2.6816e+154
[torch.DoubleTensor of size 6]

soumith commented 9 years ago

the training loop is usually run as:

model:zeroGradParameters() criterion:forward(model:forward(...), target) model:backward(...) optimization

after every mini-batch, you need to zero the gradient buffers for correctness anyways. Initialization with zeros will likely hide bugs induced by forgetting to zero the gradBuffers every iteration...

willwhitney commented 9 years ago

Yup, I get that this is the standard form. But intuitively, you'd expect this one would work just as well:

criterion:forward(model:forward(...), target) model:backward(...) optimization model:zeroGradParameters()

This probably isn't that big a deal either way (I came across it randomly, not as a bug), but it seems like since all the other fields get initialized for you, this one would too.

soumith commented 8 years ago

this has come up in the past, several times. Maybe we should initialize gradWeight / gradBias with nans.

Kaixhin commented 8 years ago

I had assumed that zeros was the case and just so happened to write an optimisation loop the latter way around, so +1 for initialising with NaNs (by the reasoning you gave above).