texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
487 stars 92 forks source link

[RuntimeError: Input, output and indices must be on the current device] when training with multiple GPUs #17

Closed marceljahnke closed 1 year ago

marceljahnke commented 2 years ago

Bug

When following the MS MARCO passage ranking example there is a RuntimeError when using multiple GPUs for training.

Starting the training via

python -m tevatron.driver.train \
  --output_dir ./retriever_model \
  --model_name_or_path bert-base-uncased \
  --save_steps 20000 \
  --train_dir ./marco/bert/train \
  --fp16 \
  --per_device_train_batch_size 2 \
  --learning_rate 5e-6 \
  --num_train_epochs 2 \
  --dataloader_num_workers 2

produces:

RuntimeError: Input, output and indices must be on the current device

Note: When running the training with above command and only one visible gpu the training starts and runs correctly.

Full Error Message

Traceback (most recent call last):
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/driver/train.py", line 118, in <module>
    main()
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/driver/train.py", line 110, in main
    trainer.train(
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/transformers/trainer.py", line 1286, in train
    tr_loss += self.training_step(model, inputs)
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/trainer.py", line 65, in training_step
    return super(DenseTrainer, self).training_step(*args) / self._dist_loss_scale_factor
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/transformers/trainer.py", line 1777, in training_step
    loss = self.compute_loss(model, inputs)
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/trainer.py", line 62, in compute_loss
    return model(query=query, passage=passage).loss
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/modeling.py", line 107, in forward
    q_hidden, q_reps = self.encode_query(query)
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/modeling.py", line 173, in encode_query
    qry_out = self.lm_q(**qry, return_dict=True)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 984, in forward
    embedding_output = self.embeddings(
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/transformers/models/bert/modeling_bert.py", line 215, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 156, in forward
    return F.embedding(
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/torch/nn/functional.py", line 1916, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Input, output and indices must be on the current device

Environment

python == 3.8.12
pytorch == 1.8.2
faiss-cpu == 1.7.1
transformers == 4.9.2
datasets == 1.11.0

CUDA Version: 10.1 Operating System: Debian GNU/Linux 10 (buster) Kernel: Linux 4.19.0-18-amd64 GPUs: 4x GTX 1080Ti 11GB CPU: Intel E5-2620v4

MXueguang commented 2 years ago

Hi @marceljahnke tevatron uses DDP for multi-gpu training

i.e. run with python -m torch.distributed.launch --nproc_per_node=4 -m tevatron.driver.train e.g. https://github.com/texttron/tevatron/tree/main/examples/dpr#2-train

marceljahnke commented 2 years ago

Hi @MXueguang, thank you for the answer. It worked.

Unfortunately another error occurred during the searching:

python -m tevatron.faiss_retriever.reducer --score_dir ranking/intermediate --query encoding/qry.pt --save_ranking_to ranking/rank.txt
0%|                                                                    | 0/10 [00:00<?, ?it/s]Initializing Heap. Assuming 6980 queries.
Traceback (most recent call last):
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/faiss_retriever/reducer.py", line 48, in <module>
    main()
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/faiss_retriever/reducer.py", line 41, in main
    corpus_scores, corpus_indices = combine_faiss_results(map(torch.load, tqdm(partitions)))
  File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/faiss_retriever/reducer.py", line 16, in combine_faiss_results
    rh.add_result(-scores, indices)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/faiss/__init__.py", line 1622, in add_result
    swig_ptr(I), self.k)
  File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/faiss/swigfaiss.py", line 5700, in swig_ptr
    return _swigfaiss.swig_ptr(a)
ValueError: did not recognize array type
  0%|                                                                    | 0/10 [00:00<?, ?it/s]
luyug commented 2 years ago

@MXueguang The reducer problem.. Have we decided how to deal with https://github.com/texttron/tevatron/pull/13 ?