pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
80.07k stars 21.52k forks source link

The ProcessGroupNCCL is not being destructed #129477

Open zhouzaida opened 1 week ago

zhouzaida commented 1 week ago

🐛 Describe the bug

I used PyTorch's multiprocessing to launch a multi-GPU task like the below snippests:

import torch.multiprocessing as mp

def worker():
    # init global nccl processgroup

    try:
        for _ in range(10000):
           # model forward
    except:
       # do something here

if __name__ == '__main__':
    mp.start_processes(worker, nprocs=2)

If I terminate the program with Ctrl-C, it might result in the child processes not being killed, which leads to the GPU memory not being released (the ProcessGroupNCCL is not being destructed). It should be noted that the process with rank=0 can enter do something here, while another process will be stuck at model forward (It often gets stuck on operations related to the GPU).

The error message is as follows:

Traceback (most recent call last):
  File "~/miniconda3/envs/py310-cu118-pt230/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "~/miniconda3/envs/pyt310-cu118-pt230/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 117, in join
    ready = multiprocessing.connection.wait(
  File "~/miniconda3/envs/py310-cu118-pt230/lib/python3.10/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "~/miniconda3/envs/py310-cu118-pt230/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
^CException ignored in atexit callback: <function _exit_function at

But sometimes the subprocesses can be killed and print the following information:

[rank1]:[I ProcessGroupNCCL.cpp:1109] [PG 1 Rank 1] ProcessGroupNCCL destructor entered.
[rank1]:[I ProcessGroupNCCL.cpp:1094] [PG 1 Rank 1] Launching ProcessGroupNCCL abort asynchrounously.
[rank1]:[I ProcessGroupNCCL.cpp:999] [PG 1 Rank 1] future is successfully executed for: ProcessGroup abort
[rank1]:[I ProcessGroupNCCL.cpp:1100] [PG 1 Rank 1] ProcessGroupNCCL aborts successfully.
[rank1]:[I ProcessGroupNCCL.cpp:1132] [PG 1 Rank 1] ProcessGroupNCCL watchdog thread joined.
[rank1]:[I ProcessGroupNCCL.cpp:1136] [PG 1 Rank 1] ProcessGroupNCCL heart beat monitor thread joined.
[rank1]:[I ProcessGroupNCCL.cpp:1109] [PG 2 Rank 1] ProcessGroupNCCL destructor entered.
[rank1]:[I ProcessGroupNCCL.cpp:1094] [PG 2 Rank 1] Launching ProcessGroupNCCL abort asynchrounously.
[rank1]:[I ProcessGroupNCCL.cpp:999] [PG 2 Rank 1] future is successfully executed for: ProcessGroup abort
[rank1]:[I ProcessGroupNCCL.cpp:1100] [PG 2 Rank 1] ProcessGroupNCCL aborts successfully.
[rank1]:[I ProcessGroupNCCL.cpp:1132] [PG 2 Rank 1] ProcessGroupNCCL watchdog thread joined.
[rank1]:[I ProcessGroupNCCL.cpp:1136] [PG 2 Rank 1] ProcessGroupNCCL heart beat monitor thread joined.
[rank0]:[I ProcessGroupNCCL.cpp:1109] [PG 1 Rank 0] ProcessGroupNCCL destructor entered.
[rank0]:[I ProcessGroupNCCL.cpp:1094] [PG 1 Rank 0] Launching ProcessGroupNCCL abort asynchrounously.
[rank0]:[I ProcessGroupNCCL.cpp:999] [PG 1 Rank 0] future is successfully executed for: ProcessGroup abort
[rank0]:[I ProcessGroupNCCL.cpp:1100] [PG 1 Rank 0] ProcessGroupNCCL aborts successfully.
[rank0]:[I ProcessGroupNCCL.cpp:1132] [PG 1 Rank 0] ProcessGroupNCCL watchdog thread joined.

Additionally, when I sleep for a few seconds before the model forward, and during this period, if I press ctrl-c, the process can exit normally.

import torch.multiprocessing as mp
import time

def worker():
    # init global nccl processgroup

    try:
        for _ in range(10000):
           time.sleep(5)
           # model forward
    except:
       # do something here

if __name__ == '__main__':
    mp.start_processes(worker, nprocs=2)

Versions

Python 3.10
NCCL version: 2.20.5
torch                     2.3.0+cu118

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

zhouzaida commented 4 days ago

Hi @malfet ,is there any progress?

zhouzaida commented 4 days ago

Hi @awgu , could you take a look at this issue?

awgu commented 4 days ago

Hey @zhouzaida! Sorry, I am going to be out for the next week. Hopefully, my team's next oncall can get to this.

zhouzaida commented 3 days ago

Thanks for your reply.