Closed i55code closed 8 years ago
hmm yeah multi-gpu seems to be broken atm--ill try to look into this
Thank you, and if it is convenient, may you point me to where I can multi-GPU on more than 2 nodes, say 4 nodes. Thank you!
within our framework, we have the encoder on gpu1, and decoder on gpu2, so it's not possible to utilize more than 2 GPUs (so this is more for memory rather than speed). i believe it should be possible utilize more GPUs for things like data parallelization: examples here https://github.com/soumith/imagenet-multiGPU.torch
Hi Yoon Kim,
Morning! I hope to check with you on multi-GPU training, really appreciate your thoughts As we discussed before, we can have multi-GPU training with encoder on one GPU and decoder on another GPU. The idea works for simple seq2seq. However, with attention, this is tricky. Attention requires training backwards with reference to source sentence for every step of the target sequence, the copying mechanism between encoder GPU and decoder GPU becomes tricky.
Have you tested the code in multi-GPU setting in a single machine, with bidirectional LSTM and attention?
Cheers, Zhong
Ok i got around to fixing this. (doesn't work when brnn = 1 though).
The problem you mention is actually not too bad, because we need to copy over the entire hidden state matrix (source length x rnn size) to the second gpu only once, then everything can be done on the second gpu. And copying across gpus is fast.
Hope this helps Yoon
Thank you! What needs to be done to get brnn =1 working for multi-GPU training?
I came across this issue today. It would be great to have brnn working with 2 GPUs. Would you kindly put some warning in the readme regarding using 2 GPUs with brnn? Hopefully that would save others some time poking around the settings. :)
yeah sorry about that. i spent a good deal of time trying to debug brnn+multi-gpu. the issue seems to be that for some reason, i can't put the backward encoder on the first gpu.
Hi s2s team,
There is a multi-GPU problem, I tried to set DIABLE_CHECK_GPU, it does not work either. Please let me know what would help. Thanks!
using CUDA on GPU 1... using CUDA on second GPU 2... loading data... done! Source vocab size: 28721, Target vocab size: 42787 Source max sent len: 50, Target max sent len: 52 Number of parameters: 66948287 /home//util/torch/install/bin/luajit: /home/util.lua:46: Assertion `THCudaTensorcheckGPU(state, 4, r, t, m1, m2)' failed. at /tmp/luarocks_cutorch-scm-1-7585/cutorch/lib/THC/THCTensorMathBlas.cu:79 stack traceback: [C]: in function 'addmm' /home/util.lua:46: in function 'func' .../util/torch/install/share/lua/5.1/nngraph/gmodule.lua:333: in function 'neteval' .../util/torch/install/share/lua/5.1/nngraph/gmodule.lua:368: in function 'forward' train.lua:367: in function 'train_batch' train.lua:622: in function 'train' train.lua:871: in function 'main' train.lua:874: in main chunk [C]: in function 'dofile' ...util/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670
Cheers, Zhong