Process waiting on barrier statements

LakshKD commented 1 year ago

Hi,

While running the multi-GPU training with the below command:

`CUDA_VISIBLE_DEVICES="0,1" python -m torch.distributed.launch --nproc_per_node=2 -m colbert.train \

--amp --doc_maxlen 180 --mask-punctuation --bsize 32 --accum 1 \ --triples /home/lakshyakumar/ColBERT/MSMARCO-Passage-Ranking/Baselines/data/triples.train.small1M.tsv \ --root /home/lakshyakumar/ColBERT_lakshya/ --experiment MSMARCO-psg \ --similarity l2 --run msmarco.psg.l2`

The code is waiting infinitely on distributed.barrier(rank) statement in the runs.py file. Please suggest a way to run it in multi-gpu setting. I am running colbertv1 branch code. My OS and pytorch related details are as follows: **1. pytorch 1.12.0 py3.7_cuda11.3_cudnn8.3.2_0 pytorch

OS is Ubuntu**

For single GPU setting the code works fine for me.

sarisel commented 1 year ago

Can you verify if you are actually getting 2 GPUs e.g. by running nvidia-smi on the GPU node?

okhat commented 1 year ago

yup the above seems like a good check

LakshKD commented 1 year ago

@sarisel, Yes, I am trying to run it on 2 GPUs, you can see below:

stanford-futuredata / ColBERT

Process waiting on barrier statements #158