unable to use multi-gpu while single gpu works

I added the following into train_sup.sh script, and was reloading from the XLM pretrained language model.

`export NGPU=8;

python -W ignore -m torch.distributed.launch --nproc_per_node=$NGPU train.py \ --expname sup$DATANAME_$SRC\$TGT `

when NGPU = 1, it runs perfectly. However anything larger than that would result in

Traceback (most recent call last): File "train.py", line 337, in <module> main(params) File "train.py", line 293, in main trainer.mt_step(lang1, lang2, params.lambda_mt) File "/home/colman/DAMT/src/trainer.py", line 852, in mt_step enc1 = self.encoder('fwd', x=x1, lengths=len1, langs=langs1, causal=False) File "/home/colman/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/colman/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 606, in forward if self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argumentfind_unused_parameters=Truetotorch.nn.parallel.DistributedDataParallel; (2) making sure allforwardfunction outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforwardfunction. Please include the loss function and the structure of the return value offorwardof your module when reporting this issue (e.g. list, dict, iterable). Would be great if anyone can shed any light on it.

jind11 / DAMT

unable to use multi-gpu while single gpu works #2