Open zhouzaida opened 1 week ago
Hi @malfet ,is there any progress?
Hi @awgu , could you take a look at this issue?
Hey @zhouzaida! Sorry, I am going to be out for the next week. Hopefully, my team's next oncall can get to this.
Thanks for your reply.
🐛 Describe the bug
I used PyTorch's
multiprocessing
to launch a multi-GPU task like the below snippests:If I terminate the program with Ctrl-C, it might result in the child processes not being killed, which leads to the GPU memory not being released (the ProcessGroupNCCL is not being destructed). It should be noted that the process with rank=0 can enter
do something here
, while another process will be stuck atmodel forward
(It often gets stuck on operations related to the GPU).The error message is as follows:
But sometimes the subprocesses can be killed and print the following information:
Additionally, when I sleep for a few seconds before the model forward, and during this period, if I press ctrl-c, the process can exit normally.
Versions
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k