Open vislab2013 opened 8 years ago
try controlling the GPU to be used via CUDA_VISIBLE_DEVICES environment variable instead.
Just reporting a weird bug that's all. You may close this issue if you want.
I run into the same problem and CUDA_VISIBLE_DEVICES solves it. thanks. e.g.: CUDA_VISIBLE_DEVICES=2 th main.lua
If nGPU is 1 then there's no call to cutorch.SetDevice(opt.GPU). Can be solved by moving line 13
cutorch.setDevice(opt.GPU)
in util.lua out of the if clause.
Getting this bug when changing the GPU device id from 1 to 2 or 3 (I have 3 GPUs on the same machine). https://github.com/soumith/imagenet-multiGPU.torch/blob/master/opts.lua#L28 Leaving it to 1 works fine.
(with GPU=2, nGPU =2) ==> doing epoch on training data: ==> online epoch # 1 Debugging session completed (traced 3 instructions). /home/mf/Toolkits/torch/install/bin/luajit: /home/mf/Toolkits/torch/install/share/lua/5.1/nn/Module.lua:70: Assertion `THCudaTensorcheckGPU(state, 1, self)' failed. at /home/mf/Toolkits/torch/extra/cutorch/lib/THC/THCTensorMath.cu:30 stack traceback: [C]: in function 'zero' /home/mf/Toolkits/torch/install/share/lua/5.1/nn/Module.lua:70: in function 'zeroGradParameters' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'func' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'zeroGradParameters' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'func' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'zeroGradParameters' ...s/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:458: in function 'zeroGradParameters' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'func' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:25: in function 'applyToModules' ...mf/Toolkits/torch/install/share/lua/5.1/nn/Container.lua:30: in function 'zeroGradParameters' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:167: in function 'opfunc' /home/mf/Toolkits/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:174: in function 'f2' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/data.lua:36: in function 'addjob' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:97: in function 'train' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/main.lua:45: in main chunk [C]: in function 'dofile' ...kits/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670 Program completed in 28.99 seconds (pid: 6719).
(with GPU=2, nGPU =1) ==> doing epoch on training data: ==> online epoch # 1 /home/mf/Toolkits/torch/install/bin/luajit: /home/mf/Toolkits/torch/install/share/lua/5.1/nn/THNN.lua:177: Assertion `THCudaTensor_checkGPU( state, 4, input, target, output, total_weight )' failed. at /home/mf/Toolkits/torch/extra/cunn/lib/THCUNN/ClassNLLCriterion.cu:123 stack traceback: [C]: in function 'v' /home/mf/Toolkits/torch/install/share/lua/5.1/nn/THNN.lua:177: in function 'ClassNLLCriterion_updateOutput' ...its/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:41: in function 'forward' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:169: in function 'opfunc' /home/mf/Toolkits/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:174: in function 'f2' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/data.lua:36: in function 'addjob' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/train.lua:97: in function 'train' /home/mf/Toolkits/Codigo/imagenet-multiGPU.torch/main.lua:45: in main chunk [C]: in function 'dofile' ...kits/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670 Program completed in 4.10 seconds (pid: 11801).
NOTE: torch is updated and cloned the master-branch repo.