luyug / Dense

A toolkit for building dense retrievers with deep language models.
Apache License 2.0
53 stars 4 forks source link

[RuntimeError: Input, output and indices must be on the current device] when training with multiple GPUs #8

Closed marceljahnke closed 2 years ago

marceljahnke commented 2 years ago

Bug

When following the MS MARCO passage ranking example there is a RuntimeError when using multiple GPUs for training.

Starting the training via

python -m dense.driver.train \
  --output_dir ./retriever_model \
  --model_name_or_path bert-base-uncased \
  --save_steps 20000 \
  --train_dir ./marco/bert/train \
  --fp16 --per_device_train_batch_size 2 \
  --learning_rate 5e-6 \
  --num_train_epochs 2 \
  --dataloader_num_workers 2

produces:

RuntimeError: Input, output and indices must be on the current device

Note: When running the training with above command and only one visible gpu the training starts and runs correctly.

Full Error Message

Traceback (most recent call last):
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/dstore/home/jahnke/Dense/src/dense/driver/train.py", line 104, in <module>
    main()
  File "/dstore/home/jahnke/Dense/src/dense/driver/train.py", line 96, in main
    trainer.train(
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/transformers/trainer.py", line 888, in train
    tr_loss += self.training_step(model, inputs)
  File "/dstore/home/jahnke/Dense/src/dense/trainer.py", line 65, in training_step
    return super(DenseTrainer, self).training_step(*args) / self._dist_loss_scale_factor
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/transformers/trainer.py", line 1248, in training_step
    loss = self.compute_loss(model, inputs)
  File "/dstore/home/jahnke/Dense/src/dense/trainer.py", line 62, in compute_loss
    return model(query=query, passage=passage).loss
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/dstore/home/jahnke/Dense/src/dense/modeling.py", line 112, in forward
    q_hidden, q_reps = self.encode_query(query)
  File "/dstore/home/jahnke/Dense/src/dense/modeling.py", line 183, in encode_query
    qry_out = self.lm_q(**qry, return_dict=True)
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 951, in forward
    embedding_output = self.embeddings(
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 200, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 145, in forward
    return F.embedding(
  File "/home/jahnke/miniconda3/envs/dense/lib/python3.8/site-packages/torch/nn/functional.py", line 1913, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device

Environment

python == 3.8.12
pytorch == 1.8.0
faiss-cpu == 1.6.5
transformers == 4.2.0
datasets == 1.1.3

CUDA Version: 10.1 Operating System: Debian GNU/Linux 10 (buster) Kernel: Linux 4.19.0-18-amd64 GPUs: 4x GTX 1080Ti 11GB CPU: Intel E5-2620v4

luyug commented 2 years ago

We have deprecated Dense in favor of Tevatron. Can you try the new repo and open issue there if the error persists.