I am facing issues while running colbert.train in a single node multi-GPU setting. I am running the below command, after setting "CUDA_VISIBLE_DEVICES=0,1":
But on execution, it is returning the following error:
[Jan 27, 07:56:01] Traceback (most recent call last):
File "/home/lakshyakumar/ColBERT_lakshya/colbert/utils/runs.py", line 70, in context
yield
File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 30, in main
train(args)
File "/home/lakshyakumar/ColBERT_lakshya/colbert/training/training.py", line 78, in train
find_unused_parameters=True)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
[Jan 27, 07:56:01] Traceback (most recent call last):
File "/home/lakshyakumar/ColBERT_lakshya/colbert/utils/runs.py", line 70, in context
yield
File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 30, in main
train(args)
File "/home/lakshyakumar/ColBERT_lakshya/colbert/training/training.py", line 78, in train
find_unused_parameters=True)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 34, in <module>
main()
File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 30, in main
train(args)
File "/home/lakshyakumar/ColBERT_lakshya/colbert/training/training.py", line 78, in train
find_unused_parameters=True)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 34, in <module>
main()
File "/home/lakshyakumar/ColBERT_lakshya/colbert/train.py", line 30, in main
train(args)
File "/home/lakshyakumar/ColBERT_lakshya/colbert/training/training.py", line 78, in train
find_unused_parameters=True)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 646, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1656352464346/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 71894) of binary: /home/lakshyakumar/anaconda3/envs/colbert-v0.4/bin/python
Traceback (most recent call last):
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/run.py", line 765, in <module>
main()
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/run.py", line 755, in run
)(*cmd_args)
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lakshyakumar/anaconda3/envs/colbert-v0.4/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
colbert.train FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-01-27_07:56:07
host : br1u41-s1-09
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 71895)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-01-27_07:56:07
host : br1u41-s1-09
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 71894)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html```
I am running the colbertv1 branch code as I wanted to try out the same in my setting. Please help me out in running it on multi-GPUs setting. Thanks
Hi,
I am facing issues while running colbert.train in a single node multi-GPU setting. I am running the below command, after setting "CUDA_VISIBLE_DEVICES=0,1":
"python -m torch.distributed.run --nproc_per_node=2 -m colbert.train --amp --accum 1 --triples /home/lakshyakumar/ColBERT/MSMARCO-Passage-Ranking/Baselines/data/triples.train.small1M.tsv"
But on execution, it is returning the following error: