soumith / imagenet-multiGPU.torch

an imagenet example in torch.
BSD 2-Clause "Simplified" License
402 stars 158 forks source link

Some help debugging? #23

Closed Atcold closed 9 years ago

Atcold commented 9 years ago

After some time I get this

Epoch: [6][4563/10000]  Time 0.425 Err 6.8380 Top1-%: 0.39 LR 1e-02 DataLoadingTime 0.006
Epoch: [6][4564/10000]  Time 0.440 Err 6.8348 Top1-%: 0.78 LR 1e-02 DataLoadingTime 0.007
Epoch: [6][4565/10000]  Time 0.438 Err 6.7681 Top1-%: 1.56 LR 1e-02 DataLoadingTime 0.007
Epoch: [6][4566/10000]  Time 0.426 Err 6.7398 Top1-%: 1.56 LR 1e-02 DataLoadingTime 0.005
Epoch: [6][4567/10000]  Time 0.423 Err 6.7808 Top1-%: 1.17 LR 1e-02 DataLoadingTime 0.004
/usr/local/bin/luajit: /usr/local/share/lua/5.1/threads/threads.lua:255: 
[thread 22 callback] bad argument #2 to '?' (out of range at /tmp/luarocks_torch-scm-1-9679/torch7/generic/Tensor.c:880)
stack traceback:
        [C]: in function 'error'
        /usr/local/share/lua/5.1/threads/threads.lua:255: in function 'synchronize'
        /usr/local/share/lua/5.1/threads/threads.lua:196: in function 'addjob'
        /home/atcold/Work/GitHub/multiGPU-train/train.lua:99: in function 'train'
        main.lua:38: in main chunk
        [C]: in function 'dofile'
        /usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406260

Then, after restarting,

Loading model from file: results-5k10k/20151104-162101-alexnetowtbn,batchSize=256,nDonkeys=24,nGPU=4,netType=alexnetowtbn,normalize=f/model_5.t7
==> Converting model to CUDA
Loading optimState from file: results-5k10k/20151104-162101-alexnetowtbn,batchSize=256,nDonkeys=24,nGPU=4,netType=alexnetowtbn,normalize=f/optimState_5.t7
==> doing epoch on training data:
==> online epoch # 6
/usr/local/bin/luajit: /usr/local/share/lua/5.1/threads/threads.lua:255: 
[thread 16 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30
[thread 8 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30
[thread 1 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30
[thread 6 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30
[thread 13 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30
[thread 2 endcallback] /usr/local/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensor_checkGPU(state, 1, self_)' failed.  at /tmp/luarocks_cutorch-scm-1-4237/cutorch/lib/THC/THCTensorMath.cu:30

and so on.

Atcold commented 9 years ago

OK, some bug from my side, I guess... Sorry.

zhefan commented 8 years ago

Hi Atcold, I got the same error message. May I ask what is your bug?

Atcold commented 8 years ago

Post here your error message with nDonkeys = 0.