soumith / imagenet-multiGPU.torch

an imagenet example in torch.
BSD 2-Clause "Simplified" License
402 stars 158 forks source link

Changing GPU device id to >1 gives an error. #32

Open vislab2013 opened 8 years ago

vislab2013 commented 8 years ago

Getting this bug when changing the GPU device id from 1 to 2 or 3 (I have 3 GPUs on the same machine). https://github.com/soumith/imagenet-multiGPU.torch/blob/master/opts.lua#L28 Leaving it to 1 works fine.

(with GPU=2, nGPU =2) ==> doing epoch on training data: ==> online epoch # 1 Debugging session completed (traced 3 instructions). /home/mf/Toolkits/torch/install/bin/luajit: /home/mf/Toolkits/torch/install/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensorcheckGPU(state, 1, self)' failed. at /home/mf/Toolkits/torch/extra/cutorch/lib/THC/THCTensorMath.cu:30 stack traceback: [C]: in function 'zero' /home/mf/Toolkits/torch/install/share/lua/5.1/nn/Module.lua:70: in function 'zeroGradParameters' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'func' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'zeroGradParameters' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'func' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'zeroGradParameters' ...s/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:458: in function 'zeroGradParameters' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'func' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'zeroGradParameters' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:167: in function 'opfunc' /home/mf/Toolkits/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:174: in function 'f2' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/data.lua:36: in function 'addjob' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:97: in function 'train' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/main.lua:45: in main chunk [C]: in function 'dofile' ...kits/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670 Program completed in 28.99 seconds (pid: 6719).

(with GPU=2, nGPU =1) ==> doing epoch on training data: ==> online epoch # 1 /home/mf/Toolkits/torch/install/bin/luajit: /home/mf/Toolkits/torch/install/share/lua/5.1/nn/THNN.lua:177: Assertion `THCudaTensor_checkGPU( state, 4, input, target, output, total_weight )' failed. at /home/mf/Toolkits/torch/extra/cunn/lib/THCUNN/ClassNLLCriterion.cu:123 stack traceback: [C]: in function 'v' /home/mf/Toolkits/torch/install/share/lua/5.1/nn/THNN.lua:177: in function 'ClassNLLCriterion_updateOutput' ...its/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:41: in function 'forward' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:169: in function 'opfunc' /home/mf/Toolkits/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:174: in function 'f2' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/data.lua:36: in function 'addjob' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:97: in function 'train' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/main.lua:45: in main chunk [C]: in function 'dofile' ...kits/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670 Program completed in 4.10 seconds (pid: 11801).

NOTE: torch is updated and cloned the master-branch repo.

soumith commented 8 years ago

try controlling the GPU to be used via CUDA_VISIBLE_DEVICES environment variable instead.

vislab2013 commented 8 years ago

Just reporting a weird bug that's all. You may close this issue if you want.

WendyShang commented 8 years ago

I run into the same problem and CUDA_VISIBLE_DEVICES solves it. thanks. e.g.: CUDA_VISIBLE_DEVICES=2 th main.lua

lukacf commented 8 years ago

If nGPU is 1 then there's no call to cutorch.SetDevice(opt.GPU). Can be solved by moving line 13

cutorch.setDevice(opt.GPU)

in util.lua out of the if clause.