Closed ruipeterpan closed 3 years ago
Hey there,
I think I have the same trouble. Any updates ?
Best regards, Thomas Chaton.
@tchaton Unfortunately I haven't been able to resolve this issue :(
Thanks for the question. Have you tried setting NCCL_BLOCKING_WAIT
(or if you are using pytorch nightly - NCCL_ASYNC_ERROR_HANDLING
env var on your trainers?
https://pytorch.org/docs/stable/distributed.html
Hey @kiukchung thanks for the pointer! Setting the environment variable export NCCL_BLOCKING_WAIT=1
makes the previously-hanging worker throw the following error, which is subsequently caught by the elastic agent.
Traceback (most recent call last):
File "worker.py", line 251, in <module>
parse_args()
File "worker.py", line 247, in parse_args
init_processes(0, args)
File "worker.py", line 220, in init_processes
train(args)
File "worker.py", line 130, in train
update_gradients(model)
File "worker.py", line 55, in update_gradients
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 948, in all_reduce
work.wait()
RuntimeError: NCCL error: unhandled system error, NCCL version 2.7.8
[ERROR] 2020-11-16 21:08:51,160 local_elastic_agent: [default] Worker group failed
Thanks again for the quick help! Closing this issue.
Context
I have been using torchelastic for a while to launch fault-tolerant jobs on CPUs using the
gloo
backend. I was switching to GPUs so that I can usebroadcast
andreduce
. I firstly made the necessary modifications to move everything onto GPUs. Then, I changed the backend for group initialization fromgloo
tonccl
hoping things will work as before. However, fornccl
, when some workers gets killed, the remaining workers stay in the previous rendezvous and hang, whereas the elastic agent should be able to detect a worker failure and halts all workers.Current Behavior
When using the
nccl
backend, when a worker is killed, the remaining workers hang instead of throwing a RuntimeError duringall_reduce()
like when using thegloo
backend.The workers that are killed outputs this (which is expected):
However, for the remaining workers, the elastic agent doesn't declare the process group as failed. Here is the log obtained by using
export NCCL_DEBUG=INFO
:Expected Behavior
Just like
gloo
, after some workers are killed, the remaining workers/gloo should be able to detect a missing member duringall_reduce()
, and throw a RuntimeError so that the local_elastic_agent can mark the worker group as failed, halt the training, and wait for a new worker to join the next rendezvous.The workers that are killed should output this:
The surviving workers should output this:
More details
dist.init_process_group(backend='gloo', init_method='env://')
to initialize the process group.