Closed infwinston closed 2 years ago
Yeah, that happens to me sometimes. Is it possible to create a small reproduction code? I suspect it is caused by the ray's stale actors.
After debugging on our GNN codebase, it seems like the job cancel still does not work with PyTorch DataParallel. It is a long-lived problem with PyTorch's DataParallel creating zombie processes. Though #235 mitigates the problem for non-mulit-gpu PyTorch code, this is still a problem.
DataPrallel creating zombie processes references:
See also zombie processes in our Slurm cluster:
This happens to me pretty much every time. The only solution for me is still ssh into it and kill. What method do we use to kill a job? Are we using ray.cancel? @Michaelvll
This happens to me pretty much every time. The only solution for me is still ssh into it and kill. What method do we use to kill a job? Are we using ray.cancel? @Michaelvll
I think this problem also happens sometimes when you run pytorch data-parallel program and Ctrl+C on the server, where the GPU memory will not be cleaned up correctly. This is more like a pytorch bug than ours as mentioned above. If you use pytorch distributed data-parallel, the problem will disappear.
oh I see. though I personally haven't had issues with Ctrl+C. Does Sky use the same way as Ctrl+C to send kill signal?
oh I see. though I personally haven't had issues with Ctrl+C. Does Sky use the same way as Ctrl+C to send kill signal?
Not exactly the same, Ctrl+C will send SIGINT
, but the canceling will send SIGKILL
. Let me see if I can send SIGINT
first to gracefully kill the task.
Fixed in #276. Closing.
I failed to cancel my job with
sky cancel
. I had to manually ssh in and kill it. @Michaelvll have you met this issue before?On the VM. the job on GPU-1 (submitted by Sky) is still running.