soumith / imagenet-multiGPU.torch

an imagenet example in torch.
BSD 2-Clause "Simplified" License
402 stars 158 forks source link

training on multiple GPUs gives nan #74

Closed arashno closed 8 years ago

arashno commented 8 years ago

Hi, I setup the code on my computer. When I am using one GPU it seems fine and there is no problem:

th main.lua

==> doing epoch on training data:
==> online epoch # 1
Epoch: [1][1/5636] Time 2.380 Err 3.9442 Top1-%: 2.34 LR 1e-02 DataLoadingTime 1.595
Epoch: [1][2/5636] Time 0.470 Err 3.6877 Top1-%: 21.09 LR 1e-02 DataLoadingTime 0.366
Epoch: [1][3/5636] Time 0.454 Err 3.2735 Top1-%: 28.12 LR 1e-02 DataLoadingTime 0.365
Epoch: [1][4/5636] Time 0.451 Err 3.2096 Top1-%: 25.78 LR 1e-02 DataLoadingTime 0.365
Epoch: [1][5/5636] Time 0.442 Err 2.8022 Top1-%: 33.59 LR 1e-02 DataLoadingTime 0.368
Epoch: [1][6/5636] Time 0.448 Err 3.0409 Top1-%: 24.22 LR 1e-02 DataLoadingTime 0.368
Epoch: [1][7/5636] Time 0.446 Err 2.7138 Top1-%: 32.81 LR 1e-02 DataLoadingTime 0.365
Epoch: [1][8/5636] Time 0.449 Err 2.7420 Top1-%: 25.78 LR 1e-02 DataLoadingTime 0.366
Epoch: [1][9/5636] Time 0.449 Err 2.7148 Top1-%: 22.66 LR 1e-02 DataLoadingTime 0.366
Epoch: [1][10/5636] Time 0.436 Err 2.6244 Top1-%: 19.53 LR 1e-02 DataLoadingTime 0.367
Epoch: [1][11/5636] Time 0.444 Err 2.7249 Top1-%: 17.19 LR 1e-02 DataLoadingTime 0.368
Epoch: [1][12/5636] Time 0.443 Err 2.5064 Top1-%: 27.34 LR 1e-02 DataLoadingTime 0.366

but when I am trying to use more than one GPU it sucks, also the runtime is getting worst! :

th main.lua -nGPU 2

Epoch: [1][1/5636] Time 2.001 Err nan Top1-%: 2.34 LR 1e-02 DataLoadingTime 2.173
Epoch: [1][2/5636] Time 2.416 Err nan Top1-%: 2.34 LR 1e-02 DataLoadingTime 0.366
Epoch: [1][3/5636] Time 2.416 Err nan Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.368
Epoch: [1][4/5636] Time 2.416 Err nan Top1-%: 3.12 LR 1e-02 DataLoadingTime 0.367
Epoch: [1][5/5636] Time 2.416 Err nan Top1-%: 2.34 LR 1e-02 DataLoadingTime 0.369

arashno commented 8 years ago

It was about Torch installation, reinstalling Torch fixed it.