texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
494 stars 94 forks source link

ddp traing multi gpu Expected all tensors to be on the same device, but found at least two devices #117

Closed yxk9810 closed 3 months ago

yxk9810 commented 5 months ago

File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2199, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)

pytorch==1.12.0 transformers==4.20.0

runing in T4 x2gpu on kaggle: https://www.kaggle.com/code/jackiewu/notebook738aa0f5d2

yxk9810 commented 5 months ago

runing command as: ! CUDA_VISIBLE_DEVICES=0,1 python -m tevatron.driver.train \ --output_dir model_msmarco \ --model_name_or_path bert-base-uncased \ --save_steps 1000 \ --train_dir /kaggle/working/train_tevatron_100.json \ --fp16 \ --per_device_train_batch_size 2 \ --train_n_passages 8 \ --learning_rate 5e-6 \ --q_max_len 64 \ --p_max_len 460 \ --num_train_epochs 3 \ --logging_steps 500 \ --overwrite_output_dir

MXueguang commented 3 months ago

will adding `--negatives_x_device ' helps?