mit-han-lab / distrifuser

[CVPR 2024 Highlight] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
https://hanlab.mit.edu/projects/distrifusion
MIT License
590 stars 23 forks source link

I have a question about running code. There is an error when running the command torchrun --nproc_per_node=2 scripts/sdxl_example.py. My torch version is 2.2.1, cuda version is 11.8, and python version is 3.10. #13

Open CharvinMei opened 5 months ago

CharvinMei commented 5 months ago

[rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator()) [rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator()) terminate called after throwing an instance of 'std::runtime_error' what(): terminate called after throwing an instance of 'NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)std::runtime_error ' what(): NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details) [2024-06-12 03:45:51,447] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 10453) of binary: /root/anaconda3/envs/distrifuser/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/distrifuser/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/sdxl_example.py FAILED

Failures: [1]: time : 2024-06-12_03:45:51 host : 692d3f5c0349 rank : 1 (local_rank: 1) exitcode : -6 (pid: 10454) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 10454

Root Cause (first observed failure): [0]: time : 2024-06-12_03:45:51 host : 692d3f5c0349 rank : 0 (local_rank: 0) exitcode : -6 (pid: 10453) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 10453

lmxyy commented 5 months ago

Looks like it is some torchrun and NCCL issue. Are you able to run it with a single GPU?

CharvinMei commented 5 months ago

Yes, I can run it with a single GPU. But when it’s set up for two GPUs, an error occurs.

lmxyy commented 5 months ago

Weird. Could you try disabling the CUDAGraph to see if it works. You can simply pass use_cuda_graph=False here.

CharvinMei commented 4 months ago

After changing the setting, I found that the error has become the following situation. “ [rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator()) [rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator()) [2024-07-08 06:10:31,482] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 99442) of binary: /home/meichangwang/miniconda3/envs/distrifuser/bin/python Traceback (most recent call last): File "/home/meichangwang/miniconda3/envs/distrifuser/bin/torchrun", line 8, in sys.exit(main()) File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/sdxl_example.py FAILED

Failures: [1]: time : 2024-07-08_06:10:31 host : ubuntu rank : 1 (local_rank: 1) exitcode : -6 (pid: 99443) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 99443

Root Cause (first observed failure): [0]: time : 2024-07-08_06:10:31 host : ubuntu rank : 0 (local_rank: 0) exitcode : -6 (pid: 99442) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 99442