stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
3.1k stars 390 forks source link

Problem while running distributed training #157

Closed LakshKD closed 1 year ago

LakshKD commented 1 year ago

Hi,

I am facing issues while running colbert.train in a single node multi-GPU setting. I am running the below command, after setting "CUDA_VISIBLE_DEVICES=0,1":

"python -m torch.distributed.run --nproc_per_node=2 -m colbert.train --amp --accum 1 --triples /home/lakshyakumar/ColBERT/MSMARCO-Passage-Ranking/Baselines/data/triples.train.small1M.tsv"

But on execution, it is returning the following error:


[Jan 27, 07:56:01] Traceback (most recent call last):
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/utils/runs.py", line 70, in context
    yield
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 30, in main
    train(args)
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/training/training.py", line 78, in train
    find_unused_parameters=True)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

[Jan 27, 07:56:01] Traceback (most recent call last):
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/utils/runs.py", line 70, in context
    yield
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 30, in main
    train(args)
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/training/training.py", line 78, in train
    find_unused_parameters=True)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

Traceback (most recent call last):
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 34, in <module>
    main()
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 30, in main
    train(args)
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/training/training.py", line 78, in train
    find_unused_parameters=True)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 34, in <module>
    main()
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 30, in main
    train(args)
  File "/home/lakshyakumar/ColBERT_lakshya/colbert/training/training.py", line 78, in train
    find_unused_parameters=True)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 71894) of binary: /home/lakshyakumar/anaconda3/envs/colbert-v0.4/bin/python
Traceback (most recent call last):
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/run.py", line 765, in <module>
    main()
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
colbert.train FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-01-27_07:56:07
  host      : br1u41-s1-09
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 71895)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-01-27_07:56:07
  host      : br1u41-s1-09
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 71894)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html```

I am running the colbertv1 branch code as I wanted to try out the same in my setting. Please help me out in running it on multi-GPUs setting. Thanks
liudan111 commented 1 year ago

I have the same problems