skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

[Core] Replace ray job submit for 3x/8.5x faster job scheduling for cluster/managed jobs #4318

Closed Michaelvll closed 6 days ago

Michaelvll commented 1 week ago

Fixes #4295 and Fixes #4293

ray job has introduced a significant delay in our job submission and additional memory consumption. Although Ray job may provide some safeguard for abnormally failed jobs, it does not provide much value for our job management when the status is handled carefully in our own job table. In this PR, we replace ray job submit with subprocess and add a new state FAILED_DRIVER for the jobs to distinguish the user program failure vs job driver failure (such as OOM).

Scheduling Speed for unmanaged jobs

The job scheduling is much faster: 60 seconds for 23 jobs (#4310) -> 25 seconds for 32 jobs (after reducing the CPU/job, it can be 60 seconds for 67 jobs), i.e. 0.38jobs/s -> 1.1 jobs/s (~3x faster)

job queue with this PR ``` 75 sky-cmd a few secs ago - - 1x[CPU:1+] PENDING ~/sky_logs/sky-2024-11-09-21-57-52-131459 74 sky-cmd a few secs ago < 1 sec < 1s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-50-439630 73 sky-cmd a few secs ago a few secs ago 2s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-50-398196 72 sky-cmd a few secs ago a few secs ago 3s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-46-160367 71 sky-cmd a few secs ago a few secs ago 4s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-45-831961 70 sky-cmd a few secs ago a few secs ago 5s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-45-193824 69 sky-cmd a few secs ago a few secs ago 6s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-44-775659 68 sky-cmd a few secs ago a few secs ago 6s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-45-117409 67 sky-cmd a few secs ago a few secs ago 7s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-44-608202 66 sky-cmd a few secs ago a few secs ago 8s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-43-623994 65 sky-cmd a few secs ago a few secs ago 9s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-42-985405 64 sky-cmd 15 secs ago 11 secs ago 11s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-38-759178 63 sky-cmd 15 secs ago 12 secs ago 12s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-38-652951 62 sky-cmd 15 secs ago 13 secs ago 13s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-38-725297 61 sky-cmd 16 secs ago 14 secs ago 14s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-37-587498 60 sky-cmd 16 secs ago 15 secs ago 15s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-37-990283 59 sky-cmd 16 secs ago 14 secs ago 14s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-37-744277 58 sky-cmd 17 secs ago 15 secs ago 15s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-36-753063 57 sky-cmd 17 secs ago 15 secs ago 15s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-36-662268 56 sky-cmd 22 secs ago 17 secs ago 17s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-31-550790 55 sky-cmd 22 secs ago 18 secs ago 18s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-31-190262 54 sky-cmd 23 secs ago 19 secs ago 19s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-31-334542 53 sky-cmd 23 secs ago 20 secs ago 20s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-30-513732 52 sky-cmd 23 secs ago 21 secs ago 21s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-30-478628 51 sky-cmd 23 secs ago 21 secs ago 21s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-30-467094 50 sky-cmd 24 secs ago 22 secs ago 22s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-29-433253 49 sky-cmd 24 secs ago 23 secs ago 23s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-29-319268 48 sky-cmd 29 secs ago 24 secs ago 24s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-25-158885 47 sky-cmd 29 secs ago 25 secs ago 25s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-25-001693 46 sky-cmd 29 secs ago 26 secs ago 26s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-24-605389 45 sky-cmd 30 secs ago 27 secs ago 27s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-24-055194 44 sky-cmd 30 secs ago 27 secs ago 27s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-23-610780 43 sky-cmd 31 secs ago 28 secs ago 28s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-22-962817 42 sky-cmd 31 secs ago 29 secs ago 29s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-22-456464 41 sky-cmd 31 secs ago 30 secs ago 30s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-22-371177 40 sky-cmd 35 secs ago 31 secs ago 31s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-17-681144 39 sky-cmd 35 secs ago 32 secs ago 32s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-17-808512 38 sky-cmd 36 secs ago 33 secs ago 33s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-17-623907 37 sky-cmd 36 secs ago 34 secs ago 34s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-16-894306 36 sky-cmd 37 secs ago 34 secs ago 34s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-16-245718 35 sky-cmd 37 secs ago 35 secs ago 35s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-16-223589 34 sky-cmd 38 secs ago 36 secs ago 36s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-15-132803 33 sky-cmd 38 secs ago 37 secs ago 37s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-15-401731 32 sky-cmd 43 secs ago 38 secs ago 38s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-10-874506 31 sky-cmd 43 secs ago 39 secs ago 39s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-10-813577 30 sky-cmd 43 secs ago 40 secs ago 40s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-10-840665 29 sky-cmd 44 secs ago 41 secs ago 41s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-09-734297 28 sky-cmd 44 secs ago 41 secs ago 41s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-09-642548 27 sky-cmd 44 secs ago 42 secs ago 42s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-09-273863 26 sky-cmd 45 secs ago 43 secs ago 43s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-07-539143 25 sky-cmd 45 secs ago 44 secs ago 44s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-07-453114 24 sky-cmd 50 secs ago 47 secs ago 47s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-02-913495 23 sky-cmd 50 secs ago 47 secs ago 47s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-02-727743 22 sky-cmd 50 secs ago 48 secs ago 48s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-03-117854 21 sky-cmd 51 secs ago 49 secs ago 49s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-02-473544 20 sky-cmd 51 secs ago 50 secs ago 50s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-01-902662 19 sky-cmd 51 secs ago 50 secs ago 50s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-01-557019 18 sky-cmd 52 secs ago 51 secs ago 51s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-00-507900 17 sky-cmd 53 secs ago 52 secs ago 52s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-59-803208 16 sky-cmd 58 secs ago 54 secs ago 54s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-55-498089 15 sky-cmd 58 secs ago 55 secs ago 55s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-55-517846 14 sky-cmd 58 secs ago 56 secs ago 56s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-55-392485 13 sky-cmd 58 secs ago 56 secs ago 56s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-55-580883 12 sky-cmd 58 secs ago 57 secs ago 57s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-55-577158 11 sky-cmd 59 secs ago 57 secs ago 57s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-54-964683 10 sky-cmd 59 secs ago 58 secs ago 58s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-53-712027 9 sky-cmd 1 min ago 59 secs ago 59s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-53-476885 8 sky-cmd 1 min ago 1 min ago 1m 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-47-424912 7 sky-cmd 1 min ago 1 min ago 1m 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-09-21-56-47-583078 ```

Scheduling speed for managed jobs

Fixes #4294, it can now keep up with the job submitting, where managed jobs scheduling speed is 29s -> 3.4s, i.e. 8.5x faster.

image

Memory consumption

32 jobs running in parallel and many jobs PENDING and being submitted

The memory consumption issue relates to #4334

Master: 8.0G This PR: 6.2G

More jobs Master: 58GB / 264 jobs (0.21GB/job) This PR: 42.3GB / 264 jobs (0.16GB/job)

Correctness/Robustness

Job can get into the following situations:

  1. Job driver successfully submitted and finish with job table status updated. This PR will have no effect to the job status as the driver will set it correctly.
  2. Job driver failed with dependency issues without setting the job status This PR will set the job in FAILED_DRIVER state as the driver process is not running while the job is not in terminal states. (Current master will set it to FAILED)
  3. Job in RUNNING state but is staled due to VM restart. This PR will set the job in FAILED_DRIVER state as the driver process is not running while the job is not in terminal states.
  4. Job driver being killed by OOM. This PR will set the job in FAILED_DRIVER state as the driver process is not running while the job is not in terminal states.

Tested (run the relevant ones):

cg505 commented 1 week ago

which I presume is because of many parallel SSH connections

@romilbhardwaj I think the SSH error should be non-fatal? Is there anything else in the controller log?

Michaelvll commented 1 week ago

Thanks for trying it out @romilbhardwaj! I think this is unrelated to this PR's changes, if we exec without limiting the number of parallelism (my testing above uses a threadpool of 8 to do exec in parallel), I suspect the same issue may happen on master branch.

The behavior of our sky exec:

  1. Get the job id from remote cluster with a SSH connection, and set the job state to INIT
  2. Add the actual job to the job to pending table, and set the job state to PENDING with another SSH connection

If the job state keeps in INIT, it is likely because the second SSH is dropped due to some connection issue.

We can file an issue for that and leave the fix for unlimited parallel exec to a future PR?

Tested with the master branch, and we also have this issue with unlimited parallel jobs:

27  sky-cmd  58 mins ago  55 mins ago  30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-34-566146                                                                                                                                              [95/1895]
26  sky-cmd  58 mins ago  55 mins ago  30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-36-028977                                                                                                                                                       
25  sky-cmd  58 mins ago  -            -         1x[CPU:1+]  FAILED     ~/sky_logs/sky-2024-11-11-23-54-34-155831                                                                                                                                                       
24  sky-cmd  58 mins ago  56 mins ago  31s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-33-531933                                                                                                                                                       
23  sky-cmd  58 mins ago  55 mins ago  31s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-33-835152                                                                                                                                                       
22  sky-cmd  58 mins ago  55 mins ago  31s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-31-328397                                                                                                                                                       
21  sky-cmd  58 mins ago  56 mins ago  30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-35-596251                                                                                                                                                       
20  sky-cmd  58 mins ago  -            -         1x[CPU:1+]  FAILED     ~/sky_logs/sky-2024-11-11-23-54-34-519459                                                                                                                                                       
19  sky-cmd  58 mins ago  -            -         1x[CPU:1+]  FAILED     ~/sky_logs/sky-2024-11-11-23-54-34-357599                                                                                                                                                       
18  sky-cmd  58 mins ago  56 mins ago  30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-31-581051                                                                                                                                                       
17  sky-cmd  58 mins ago  -            -         1x[CPU:1+]  FAILED     ~/sky_logs/sky-2024-11-11-23-54-35-661370                                                                                                                                                       
16  sky-cmd  58 mins ago  56 mins ago  32s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-34-265409                                                                                                                                                       
15  sky-cmd  58 mins ago  56 mins ago  30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-34-431275                                                                                                                                                       
14  sky-cmd  58 mins ago  56 mins ago  31s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-34-712315                                                                                                                                                       
13  sky-cmd  58 mins ago  56 mins ago  31s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-35-974798                                                                                                                                                       
12  sky-cmd  58 mins ago  -            -         1x[CPU:1+]  FAILED     ~/sky_logs/sky-2024-11-11-23-54-35-871268                                                                                                                                                       
11  sky-cmd  58 mins ago  -            -         1x[CPU:1+]  FAILED     ~/sky_logs/sky-2024-11-11-23-54-35-594255                                                                                                                                                       
10  sky-cmd  58 mins ago  -            -         1x[CPU:1+]  FAILED     ~/sky_logs/sky-2024-11-11-23-54-31-384822                                                                                                                                                       
9   sky-cmd  1 hr ago     57 mins ago  30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-33-917343                                                                                                                                                       
8   sky-cmd  1 hr ago     57 mins ago  30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-33-488413                                                                                                                                                       
7   sky-cmd  1 hr ago     58 mins ago  30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-36-823567                                                                                                                                                       
6   sky-cmd  1 hr ago     58 mins ago  30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-32-767428                                                                                                                                                       
5   sky-cmd  1 hr ago     58 mins ago  30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-33-170796                                                                                                                                                       
4   sky-cmd  1 hr ago     -            -         1x[CPU:1+]  FAILED     ~/sky_logs/sky-2024-11-11-23-54-35-644537                                                                                                                                                       
3   sky-cmd  1 hr ago     58 mins ago  30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-35-298117                                                                                                                                                       
2   sky-cmd  1 hr ago     1 hr ago     30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-33-565838                                                                                                                                                       
1   sky-cmd  1 hr ago     1 hr ago     30s       1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-11-23-54-37-107369 
romilbhardwaj commented 1 week ago

I think the SSH error should be non-fatal? Is there anything else in the controller log?

This was regular sky exec (not sky jobs launch), so no controller logs. Only error I saw was a bunch of:

mux_client_request_session: session request failed: Session open refused by peer
ControlSocket /tmp/skypilot_ssh_2ea485ea/f8972454af/d726266e8825cc1d786f54c69678eccb59300ed1 already exists, disabling multiplexing

We can file an issue for that and leave the fix for unlimited parallel exec to a future PR?

sgtm!

Michaelvll commented 1 week ago

Made some major changes in the cancellation. Testing again on 31bce60:

Tested (run the relevant ones):

Michaelvll commented 1 week ago

An issue found: After sky cancel several jobs, some jobs close to that job may turn into FAILED_DRIVER state. Hypothesis: the other jobs may be scheduled by the scheduler.schedule_step call in the say driver of the cancelled job, which might cause the process to be within the same process group or be a children process of that cancelled job driver process, causing the job being killed by the kill_process_daemon. We should fix this

142  sky-cmd  9 mins ago   20 secs ago     20s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-06-716421  
141  sky-cmd  9 mins ago   20 secs ago     20s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-06-450754  
140  sky-cmd  9 mins ago   21 secs ago     21s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-06-619355  
139  sky-cmd  9 mins ago   22 secs ago     22s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-05-888792  
138  sky-cmd  9 mins ago   23 secs ago     23s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-06-738846  
137  sky-cmd  9 mins ago   23 secs ago     23s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-05-599533  
136  sky-cmd  9 mins ago   24 secs ago     24s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-05-721108  
135  sky-cmd  9 mins ago   25 secs ago     25s       1x[CPU:1+]  RUNNING        ~/sky_logs/sky-2024-11-14-02-05-05-505060  
134  sky-cmd  9 mins ago   -               -         1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-05-04-679146  
133  sky-cmd  10 mins ago  8 mins ago      7m 46s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-52-103688  
132  sky-cmd  10 mins ago  8 mins ago      7m 47s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-440670  
131  sky-cmd  10 mins ago  8 mins ago      7m 48s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-589677  
130  sky-cmd  10 mins ago  8 mins ago      7m 49s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-925204  
129  sky-cmd  10 mins ago  8 mins ago      8m 18s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-537510  
128  sky-cmd  10 mins ago  8 mins ago      8m 11s    1x[CPU:1+]  CANCELLED      ~/sky_logs/sky-2024-11-14-02-04-50-996221  
127  sky-cmd  10 mins ago  8 mins ago      8m 20s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-034328  
126  sky-cmd  10 mins ago  8 mins ago      8m 21s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-50-943330  
125  sky-cmd  10 mins ago  8 mins ago      8m 22s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-51-022167  
124  sky-cmd  10 mins ago  8 mins ago      8m 23s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-50-909505  
123  sky-cmd  10 mins ago  8 mins ago      8m 24s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-50-789364  
122  sky-cmd  10 mins ago  8 mins ago      8m 25s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-50-775809  
121  sky-cmd  10 mins ago  8 mins ago      8m 29s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-50-334967  
120  sky-cmd  10 mins ago  8 mins ago      8m 30s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-49-284130  
119  sky-cmd  10 mins ago  8 mins ago      8m 31s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-49-842914  
118  sky-cmd  10 mins ago  8 mins ago      8m 32s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-49-201166  
117  sky-cmd  10 mins ago  8 mins ago      8m 33s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-37-037409  
116  sky-cmd  10 mins ago  9 mins ago      8m 34s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-37-084538  
115  sky-cmd  10 mins ago  9 mins ago      8m 35s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-36-744599  
114  sky-cmd  10 mins ago  9 mins ago      8m 36s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-37-003736  
113  sky-cmd  10 mins ago  9 mins ago      8m 37s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-36-864517  
112  sky-cmd  10 mins ago  9 mins ago      8m 38s    1x[CPU:1+]  FAILED_DRIVER  ~/sky_logs/sky-2024-11-14-02-04-36-710345  
111  sky-cmd  10 mins ago  9 mins ago      8m 31s    1x[CPU:1+]  CANCELLED      ~/sky_logs/sky-2024-11-14-02-04-37-177211  

The problem comes from that start_new_session we set for starting the job process is only creating a new session, and the process itself is still a child process of the original process that calls the subprocess. This causes the children process killing destroying the other jobs as well.

To reproduce:

  1. start 32 jobs with 60 minutes runtime
  2. start another 200 jobs with 600000 minutes runtime
  3. After new jobs being scheduled, by some of the 32 jobs finish.
  4. Cancel some of the process

    See the ps faux output on the cluster below

    ubuntu     17888  0.0  0.0   7764  3456 ?        Ss   03:26   0:00 /bin/bash -c echo "SKYPILOT_JOB_ID <95>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_95> ~/sky_logs/sky-2024-11-14-03-25-43-592146/r
    ubuntu     17891  0.3  0.1 23941308 115528 ?     Sl   03:26   0:02  \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_95
    ubuntu     18021  0.0  0.0   7764  3456 ?        Ss   03:26   0:00      \_ /bin/bash -c echo "SKYPILOT_JOB_ID <96>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_96> ~/sky_logs/sky-2024-11-14-03-25-43-
    ubuntu     18024  0.3  0.1 23941308 115592 ?     Sl   03:26   0:02          \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_96
    ubuntu     18155  0.0  0.0   7764  3328 ?        Ss   03:26   0:00              \_ /bin/bash -c echo "SKYPILOT_JOB_ID <97>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_97> ~/sky_logs/sky-2024-11-14-0
    ubuntu     18158  0.3  0.1 23942320 115460 ?     Sl   03:26   0:02                  \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_97
    ubuntu     18311  0.0  0.0   7764  3456 ?        Ss   03:26   0:00                      \_ /bin/bash -c echo "SKYPILOT_JOB_ID <98>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_98> ~/sky_logs/sky-2024
    ubuntu     18314  0.3  0.1 23942320 115840 ?     Sl   03:26   0:02                          \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_98                  
Michaelvll commented 1 week ago

After investigating the comment here: https://github.com/skypilot-org/skypilot/pull/4318#issuecomment-2475229178

It seems that all the job driver processes we run are under control, and sending SIGTERM to the process group is enough, as the driver processes will correctly clean up the underlying tasks. For example, the job process in the process group above

ubuntu     18155  0.0  0.0   7764  3328 ?        Ss   03:26   0:00              \_ /bin/bash -c echo "SKYPILOT_JOB_ID <97>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_97> ~/sky_logs/sky-2024-11-14-0
ubuntu     18158  0.3  0.1 23942320 115460 ?     Sl   03:26   0:02                  \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_97

The job driver process starts the actual user jobs as a ray task under raylet, which does not inherits from the driver process, and killing the driver process will have the raylet to clean up the specific ray task:

ubuntu     20760  0.2  0.1 23643804 112152 ?     SNl  03:27   0:04  \_ ray::sky-cmd,
ubuntu     20957  0.0  0.0   2892  1664 ?        SNs  03:27   0:00  |   \_ /bin/sh -c /bin/bash -i /tmp/sky_app_cx2gd3_p
ubuntu     20958  0.0  0.0   9188  4992 ?        SN   03:27   0:00  |   |   \_ /bin/bash -i /tmp/sky_app_cx2gd3_p
ubuntu     21086  0.0  0.0   6192  1920 ?        SN   03:27   0:00  |   |       \_ sleep 3600
ubuntu     20966  0.0  0.0      0     0 ?        ZN   03:27   0:00  |   \_ [python] <defunct>

Hence, we don't need to start a daemon for forcefully kill the process group during cancellation, which significantly reduce the time sky cancel takes.

Michaelvll commented 1 week ago

Our current process tree for the driver processes are not ideal, as everything is chained in a single tree, and canceling a job will split the tree into two, e.g. for the tree above, if we cancel 97, the process tree becomes

ubuntu     17891  0.2  0.1 23941308 115656 ?     Sl   03:26   0:05  \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_95
ubuntu     18021  0.0  0.0   7764  3456 ?        Ss   03:26   0:00      \_ /bin/bash -c echo "SKYPILOT_JOB_ID <96>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_96> ~/sky_logs/sky-2024-11-14-03-25-43-523420
ubuntu     18024  0.2  0.1 23941308 115976 ?     Sl   03:26   0:05          \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_96
ubuntu     18155  0.0  0.0      0     0 ?        Zs   03:26   0:00              \_ [bash] <defunct>
ubuntu     18311  0.0  0.0   7764  3456 ?        Ss   03:26   0:00 /bin/bash -c echo "SKYPILOT_JOB_ID <98>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_98> ~/sky_logs/sky-2024-11-14-03-25-55-373930/run.log
ubuntu     18314  0.2  0.1 23942320 115840 ?     Sl   03:26   0:05  \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_98
ubuntu     18444  0.0  0.0   7764  3456 ?        Ss   03:26   0:00      \_ /bin/bash -c echo "SKYPILOT_JOB_ID <99>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_99> ~/sky_logs/sky-2024-11-14-03-25-56-846300
ubuntu     18447  0.2  0.1 23942332 115840 ?     Sl   03:26   0:05          \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_99    

This works at the moment, but we should move to a more elegant solution using skylet as the single source that start the job driver processes.

romilbhardwaj commented 1 week ago

Thanks for investigating @Michaelvll. I just confirmed correctness of sky cancel with this misbehaving script:

run: |
  # Trap SIGTERM and ignore it
  trap "" SIGTERM

  for ((i=1; i<=3600; i++)); do
    echo "Count: $i"
    echo "Count: $i" >> /tmp/count.txt
    sleep 1
  done

sky cancel indeed kills the process.

Michaelvll commented 1 week ago

With 5650d26, we are now able to avoid the chain of processes. : )

Michaelvll commented 1 week ago