why I cannot use gpu 0?

SeekPoint commented 8 years ago

rzai@rzai00:~/prj/seq2seq-attn-1$ th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model -gpuid 0 -num_layers 4 -rnn_size 500 using CUDA on GPU 0...
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6130/cutorch/init.c line=719 error=10 : invalid device ordinal /home/rzai/torch/install/bin/luajit: train.lua:957: cuda runtime error (10) : invalid device ordinal at /tmp/luarocks_cutorch-scm-1-6130/cutorch/init.c:719 stack traceback: [C]: in function 'setDevice' train.lua:957: in function 'main' train.lua:1074: in main chunk [C]: in function 'dofile' ...rzai/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670 rzai@rzai00:~/prj/seq2seq-attn-1$

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 519 C /home/rzai/torch/install/bin/luajit 253MiB | | 0 639 C /home/rzai/torch/install/bin/luajit 247MiB | | 0 1332 G /usr/lib/xorg/Xorg 183MiB | | 0 2449 G compiz 59MiB | | 1 519 C /home/rzai/torch/install/bin/luajit 6317MiB | | 1 639 C /home/rzai/torch/install/bin/luajit 981MiB | +-----------------------------------------------------------------------------+ rzai@rzai00:~/prj/seq2seq-attn-1$

jsenellart commented 8 years ago

it is just a difference of GPU index counting. nvidia-smi GPU index starts at 0 - while the -gpuid and -gpuid2 parameters GPU index starts at 1. So when you use -gpuid 1, you are doing the training on GPU index 0.

Note also that you should use CUDA_VISIBLE_DEVICES=ID environment variable so that your process does not use memory on the other un-unsed GPU. In that case, the ID is the 0-based one. So typically for doing a training on GPU0:

CUDA_VISIBLE_DEVICES=0 th train.lua -gpuid 1 ...

for doing a training on GPU1:

CUDA_VISIBLE_DEVICES=1 th train.lua -gpuid 1 ...

for doing a training on both:

CUDA_VISIBLE_DEVICES=0,1 th train.lua -gpuid 1 -gpuid2 2 ...

SeekPoint commented 8 years ago

@jsenellart-systran that 's help ,thanks

but when I set '-gpuid 1', nvidia-smi shows the lua runs on GPU 1

jsenellart commented 8 years ago

it should not - check the point about CUDA_VISIBLE_DEVICES, you are probably seeing the default memory footprint used by cutorch on all visible devices

harvardnlp / seq2seq-attn

why I cannot use gpu 0? #64