soumith / imagenet-multiGPU.torch

an imagenet example in torch.
BSD 2-Clause "Simplified" License
402 stars 158 forks source link

Cannot train googlenet #39

Open ppwwyyxx opened 8 years ago

ppwwyyxx commented 8 years ago

I tried to train googlenet with the command:

 th main.lua -data ./imagenet -netType googlenet -nGPU 1 -nDonkeys 4

And it throws a lot of exceptions at the very beginning:

==> doing epoch on training data:
==> online epoch # 1
/home/wyx/torch/install/bin/luajit: /home/wyx/torch/install/share/lua/5.1/threads/threads.lua:264: 
[thread 3 endcallback] ...wyx/torch/install/share/lua/5.1/cudnn/SpatialSoftMax.lua:71: assertion failed!
stack traceback:
        [C]: in function 'assert'
        ...wyx/torch/install/share/lua/5.1/cudnn/SpatialSoftMax.lua:71: in function 'updateGradInput'
        /home/wyx/torch/install/share/lua/5.1/nn/Module.lua:30: in function 'backward'
        /home/wyx/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
        /home/wyx/torch/install/share/lua/5.1/nn/Concat.lua:70: in function 'backward'
        /home/wyx/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
        /home/wyx/imagenet-multiGPU.torch/train.lua:171: in function 'opfunc'
        /home/wyx/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'
        /home/wyx/imagenet-multiGPU.torch/train.lua:174: in function </home/wyx/imagenet-multiGPU.torch/train.lua:155>
        [C]: in function 'xpcall'
        /home/wyx/torch/install/share/lua/5.1/threads/threads.lua:173: in function 'dojob'
        /home/wyx/torch/install/share/lua/5.1/threads/threads.lua:220: in function 'addjob'
        /home/wyx/imagenet-multiGPU.torch/train.lua:97: in function 'train'
        main.lua:44: in main chunk
        [C]: in function 'dofile'
        .../wyx/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00405d30

The assertion at SpatialSoftMax.lua:71 is:

assert(gradOutput:isContiguous());

However if I change googlenet to vgg in the command, it can start training.

mrdeanplumbley commented 8 years ago

Yup, I get the same thing