skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.68k stars 493 forks source link

Failed to kill a sky job with sky cancel #233

Closed infwinston closed 2 years ago

infwinston commented 2 years ago

I failed to cancel my job with sky cancel. I had to manually ssh in and kill it. @Michaelvll have you met this issue before?

(sky) weichiang@blaze:~/repos/bert-sign/sky-experiments/prototype/examples$ sky cancel -c sky-2920-weichiang 9
Cancelling jobs (9) on cluster sky-2920-weichiang...
(sky) weichiang@blaze:~/repos/bert-sign/sky-experiments/prototype/examples$ sky queue sky-2920-weichiang
Fetching and parsing job queue...

Sky Job Queue of Cluster sky-2920-weichiang
ID  NAME       USER       SUBMITTED   STATUS     LOG
9   bert_sign  weichiang  9 mins ago  CANCELLED  sky_logs/sky-2022-01-22-00-35-33-025250
8   bert_sign  weichiang  4 hrs ago   FAILED     sky_logs/sky-2022-01-21-20-18-28-505572
7   bert_sign  weichiang  8 hrs ago   FAILED     sky_logs/sky-2022-01-21-16-21-57-228872
6   <cmd>      weichiang  8 hrs ago   FAILED     sky_logs/sky-2022-01-21-16-01-15-015994
5   bert_sign  weichiang  8 hrs ago   FAILED     sky_logs/sky-2022-01-21-15-51-10-260086
4   bert_sign  weichiang  9 hrs ago   FAILED     sky_logs/sky-2022-01-21-15-42-01-661227
3   bert_sign  weichiang  9 hrs ago   FAILED     sky_logs/sky-2022-01-21-15-40-42-390030
2   bert_sign  weichiang  9 hrs ago   FAILED     sky_logs/sky-2022-01-21-15-37-32-004498
1   bert_sign  weichiang  9 hrs ago   FAILED     sky_logs/sky-2022-01-21-15-32-41-239102

On the VM. the job on GPU-1 (submitted by Sky) is still running.

(flax) ubuntu@ip-172-31-14-71:~/tensorflow_datasets/lm1b$ gpustat
ip-172-31-14-71          Sat Jan 22 08:51:30 2022  450.142.00
[0] Tesla V100-SXM2-16GB | 49°C,   3 % | 15180 / 16160 MB | ubuntu(15177M)
[1] Tesla V100-SXM2-16GB | 63°C, 100 % | 16022 / 16160 MB | ubuntu(16011M)
[2] Tesla V100-SXM2-16GB | 39°C,   0 % |     0 / 16160 MB |
[3] Tesla V100-SXM2-16GB | 42°C,   0 % |     0 / 16160 MB |
[4] Tesla V100-SXM2-16GB | 41°C,   0 % |     0 / 16160 MB |
[5] Tesla V100-SXM2-16GB | 42°C,   0 % |     0 / 16160 MB |
[6] Tesla V100-SXM2-16GB | 40°C,   0 % |     0 / 16160 MB |
[7] Tesla V100-SXM2-16GB | 41°C,   0 % |     0 / 16160 MB |
106906 ubuntu     20   0  4648   836   760 S  0.0  0.0  0:00.00 ├─ /bin/sh -c /bin/bash /tmp/sky_app_kabqgmyi
106913 ubuntu     20   0 13444  3532  3104 S  0.0  0.0  0:00.00 │  └─ /bin/bash /tmp/sky_app_kabqgmyi                                                                                                              
106935 ubuntu     20   0 66.4G 40.6G 5192M S  0.0  8.4 13:10.59 │     └─ python main.py --configs configs/products/bert-finetune/sbert-large-tmp2-mixup-deep.yml --data_root /home/ubuntu/dataset/
Michaelvll commented 2 years ago

Yeah, that happens to me sometimes. Is it possible to create a small reproduction code? I suspect it is caused by the ray's stale actors.

Michaelvll commented 2 years ago

After debugging on our GNN codebase, it seems like the job cancel still does not work with PyTorch DataParallel. It is a long-lived problem with PyTorch's DataParallel creating zombie processes. Though #235 mitigates the problem for non-mulit-gpu PyTorch code, this is still a problem.

DataPrallel creating zombie processes references:

  1. https://discuss.pytorch.org/t/pytorch-causes-zombie-processes-on-multi-gpu-system/140097
  2. https://discuss.pytorch.org/t/when-i-shut-down-the-pytorch-program-by-kill-i-encountered-the-problem-with-the-gpu/6315
  3. https://discuss.pytorch.org/t/dataparallel-causing-zombie-process-and-filled-gpu/4702
concretevitamin commented 2 years ago

See also zombie processes in our Slurm cluster:

infwinston commented 2 years ago

This happens to me pretty much every time. The only solution for me is still ssh into it and kill. What method do we use to kill a job? Are we using ray.cancel? @Michaelvll

Michaelvll commented 2 years ago

This happens to me pretty much every time. The only solution for me is still ssh into it and kill. What method do we use to kill a job? Are we using ray.cancel? @Michaelvll

I think this problem also happens sometimes when you run pytorch data-parallel program and Ctrl+C on the server, where the GPU memory will not be cleaned up correctly. This is more like a pytorch bug than ours as mentioned above. If you use pytorch distributed data-parallel, the problem will disappear.

infwinston commented 2 years ago

oh I see. though I personally haven't had issues with Ctrl+C. Does Sky use the same way as Ctrl+C to send kill signal?

Michaelvll commented 2 years ago

oh I see. though I personally haven't had issues with Ctrl+C. Does Sky use the same way as Ctrl+C to send kill signal?

Not exactly the same, Ctrl+C will send SIGINT, but the canceling will send SIGKILL. Let me see if I can send SIGINT first to gracefully kill the task.

Michaelvll commented 2 years ago

Fixed in #276. Closing.