harvardnlp / seq2seq-attn

Sequence-to-sequence model with LSTM encoder/decoders and attention
http://nlp.seas.harvard.edu/code
MIT License
1.26k stars 278 forks source link

why I cannot use gpu 0? #64

Closed SeekPoint closed 7 years ago

SeekPoint commented 8 years ago

rzai@rzai00:~/prj/seq2seq-attn-1$ th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model -gpuid 0 -num_layers 4 -rnn_size 500 using CUDA on GPU 0...
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6130/cutorch/init.c line=719 error=10 : invalid device ordinal /home/rzai/torch/install/bin/luajit: train.lua:957: cuda runtime error (10) : invalid device ordinal at /tmp/luarocks_cutorch-scm-1-6130/cutorch/init.c:719 stack traceback: [C]: in function 'setDevice' train.lua:957: in function 'main' train.lua:1074: in main chunk [C]: in function 'dofile' ...rzai/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670 rzai@rzai00:~/prj/seq2seq-attn-1$

rzai@rzai00:~/prj/seq2seq-attn-1$ nvidia-smi Wed Oct 26 16:18:08 2016
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.48 Driver Version: 367.48 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 0000:01:00.0 On | N/A | | 33% 55C P2 44W / 180W | 747MiB / 8113MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A | | 47% 66C P2 155W / 180W | 7301MiB / 8113MiB | 100% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 519 C /home/rzai/torch/install/bin/luajit 253MiB | | 0 639 C /home/rzai/torch/install/bin/luajit 247MiB | | 0 1332 G /usr/lib/xorg/Xorg 183MiB | | 0 2449 G compiz 59MiB | | 1 519 C /home/rzai/torch/install/bin/luajit 6317MiB | | 1 639 C /home/rzai/torch/install/bin/luajit 981MiB | +-----------------------------------------------------------------------------+ rzai@rzai00:~/prj/seq2seq-attn-1$

jsenellart commented 8 years ago

it is just a difference of GPU index counting. nvidia-smi GPU index starts at 0 - while the -gpuid and -gpuid2 parameters GPU index starts at 1. So when you use -gpuid 1, you are doing the training on GPU index 0.

Note also that you should use CUDA_VISIBLE_DEVICES=ID environment variable so that your process does not use memory on the other un-unsed GPU. In that case, the ID is the 0-based one. So typically for doing a training on GPU0:

CUDA_VISIBLE_DEVICES=0 th train.lua -gpuid 1 ...

for doing a training on GPU1:

CUDA_VISIBLE_DEVICES=1 th train.lua -gpuid 1 ...

for doing a training on both:

CUDA_VISIBLE_DEVICES=0,1 th train.lua -gpuid 1 -gpuid2 2 ...

SeekPoint commented 8 years ago

@jsenellart-systran that 's help ,thanks

but when I set '-gpuid 1', nvidia-smi shows the lua runs on GPU 1

jsenellart commented 8 years ago

it should not - check the point about CUDA_VISIBLE_DEVICES, you are probably seeing the default memory footprint used by cutorch on all visible devices