stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.67k stars 355 forks source link

How to get rid of the "Duplicate GPU detected : rank 0 and rank 1 both on CUDA device ca000" error while training of the ColBERTv1.9 modell? #331

Open Aritra02091998 opened 3 months ago

Aritra02091998 commented 3 months ago

I am trying to finetune the ColBERT v1.9 on my specific dataset for retrieval, but unable to do so. I encountered the below error:-

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1702400431970/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.6 ncclInvalidUsage: This usually reflects invalid usage of NCCL library. Last error: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device ca000

I guess it is some issues with the torch.distributed settings. Please help how can I resolve this ?

My specificateions are:

Single NVIDIA A40 GPU Conda Package Manager Python 3.8

4entertainment commented 1 month ago

Hello,

I don't have a solution for the problem you are experiencing. I wish you good luck and success. I would like to ask you to answer a question: Can you share the code(s) you used for the "ColBERT v1.9 on my specific dataset for retrieval" operation?

Thank you for your interest.