Open foelin opened 7 years ago
Instead of calling gradParams:zero()
, try calling instead model:zeroGradParameters()
Thanks a lot for your help! It works in one machine with nccl intalled The output:
converting module to nn.DataParallelTable
netD params error before optim 0
gradParams_single:sum(): -658.81219482422
gradParams_multi:sum(): -660.19378662109
netD params error after optim: 0.00019995868206024
netD gradParams error: 0.0081400275230408
netD output error: 0.0074694454669952
End of Testing!
However, in another machine without nccl, it fails.. The output:
converting module to nn.DataParallelTable
warning: could not load nccl, falling back to default communication
converting module to nn.DataParallelTable
warning: could not load nccl, falling back to default communication
netD params error before optim 0
gradParams_single:sum(): -453.74481201172
gradParams_multi:sum(): -280.88250732422
netD params error after optim: 0.00019999733194709
netD gradParams error: 0.62991511821747
netD output error: 5.8884222854298e+23
End of Testing!
Same issue: https://github.com/torch/cunn/issues/457
Why does this not cause an issue with optim, which directly operates on the gradParams?
Hi All,
I get NAN in gradParameters when training with multiple GPU. I have tried on both cuda 7.5 (two K80) and cuda 8.0 (two 1080P), and got similar error. Any suggetion will be great appreciated! Thanks
code
output:
converting module to nn.DataParallelTable netD params error before optim 0 gradParams_single:sum(): -658.27288818359 gradParams_multi:sum(): nan netD params error after optim: 0.00020003318786621 netD gradParams error: 28949807628288 netD output error: 0.0060847103595734 End of Testing!