RuntimeError: Input, output and indices must be on the current device
Note: When running the training with above command and only one visible gpu the training starts and runs correctly.
Full Error Message
Traceback (most recent call last):
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/dstore/home/jahnke/Dense/src/dense/driver/train.py", line 104, in <module>
main()
File "/dstore/home/jahnke/Dense/src/dense/driver/train.py", line 96, in main
trainer.train(
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/transformers/trainer.py", line 888, in train
tr_loss += self.training_step(model, inputs)
File "/dstore/home/jahnke/Dense/src/dense/trainer.py", line 65, in training_step
return super(DenseTrainer, self).training_step(*args) / self._dist_loss_scale_factor
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/transformers/trainer.py", line 1248, in training_step
loss = self.compute_loss(model, inputs)
File "/dstore/home/jahnke/Dense/src/dense/trainer.py", line 62, in compute_loss
return model(query=query, passage=passage).loss
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/dstore/home/jahnke/Dense/src/dense/modeling.py", line 112, in forward
q_hidden, q_reps = self.encode_query(query)
File "/dstore/home/jahnke/Dense/src/dense/modeling.py", line 183, in encode_query
qry_out = self.lm_q(**qry, return_dict=True)
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 951, in forward
embedding_output = self.embeddings(
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 200, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 145, in forward
return F.embedding(
File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/functional.py", line 1913, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device
Bug
When following the MS MARCO passage ranking example there is a RuntimeError when using multiple GPUs for training.
Starting the training via
produces:
Note: When running the training with above command and only one visible gpu the training starts and runs correctly.
Full Error Message
Environment
CUDA Version: 10.1 Operating System: Debian GNU/Linux 10 (buster) Kernel: Linux 4.19.0-18-amd64 GPUs: 4x GTX 1080Ti 11GB CPU: Intel E5-2620v4