Open CharvinMei opened 5 months ago
Looks like it is some torchrun
and NCCL
issue. Are you able to run it with a single GPU?
Yes, I can run it with a single GPU. But when it’s set up for two GPUs, an error occurs.
Weird. Could you try disabling the CUDAGraph to see if it works. You can simply pass use_cuda_graph=False
here.
”
[rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator()) [rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator()) terminate called after throwing an instance of 'std::runtime_error' what(): terminate called after throwing an instance of 'NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)std::runtime_error ' what(): NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details) [2024-06-12 03:45:51,447] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 10453) of binary: /root/anaconda3/envs/distrifuser/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/distrifuser/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/sdxl_example.py FAILED
Failures: [1]: time : 2024-06-12_03:45:51 host : 692d3f5c0349 rank : 1 (local_rank: 1) exitcode : -6 (pid: 10454) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 10454
Root Cause (first observed failure): [0]: time : 2024-06-12_03:45:51 host : 692d3f5c0349 rank : 0 (local_rank: 0) exitcode : -6 (pid: 10453) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 10453