Closed Michaelvll closed 6 days ago
which I presume is because of many parallel SSH connections
@romilbhardwaj I think the SSH error should be non-fatal? Is there anything else in the controller log?
Thanks for trying it out @romilbhardwaj! I think this is unrelated to this PR's changes, if we exec
without limiting the number of parallelism (my testing above uses a threadpool of 8 to do exec
in parallel), I suspect the same issue may happen on master branch.
The behavior of our sky exec
:
If the job state keeps in INIT, it is likely because the second SSH is dropped due to some connection issue.
We can file an issue for that and leave the fix for unlimited parallel exec to a future PR?
Tested with the master branch, and we also have this issue with unlimited parallel jobs:
27 sky-cmd 58 mins ago 55 mins ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-34-566146 [95/1895]
26 sky-cmd 58 mins ago 55 mins ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-36-028977
25 sky-cmd 58 mins ago - - 1x[CPU:1+] FAILED ~/sky_logs/sky-2024-11-11-23-54-34-155831
24 sky-cmd 58 mins ago 56 mins ago 31s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-33-531933
23 sky-cmd 58 mins ago 55 mins ago 31s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-33-835152
22 sky-cmd 58 mins ago 55 mins ago 31s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-31-328397
21 sky-cmd 58 mins ago 56 mins ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-35-596251
20 sky-cmd 58 mins ago - - 1x[CPU:1+] FAILED ~/sky_logs/sky-2024-11-11-23-54-34-519459
19 sky-cmd 58 mins ago - - 1x[CPU:1+] FAILED ~/sky_logs/sky-2024-11-11-23-54-34-357599
18 sky-cmd 58 mins ago 56 mins ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-31-581051
17 sky-cmd 58 mins ago - - 1x[CPU:1+] FAILED ~/sky_logs/sky-2024-11-11-23-54-35-661370
16 sky-cmd 58 mins ago 56 mins ago 32s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-34-265409
15 sky-cmd 58 mins ago 56 mins ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-34-431275
14 sky-cmd 58 mins ago 56 mins ago 31s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-34-712315
13 sky-cmd 58 mins ago 56 mins ago 31s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-35-974798
12 sky-cmd 58 mins ago - - 1x[CPU:1+] FAILED ~/sky_logs/sky-2024-11-11-23-54-35-871268
11 sky-cmd 58 mins ago - - 1x[CPU:1+] FAILED ~/sky_logs/sky-2024-11-11-23-54-35-594255
10 sky-cmd 58 mins ago - - 1x[CPU:1+] FAILED ~/sky_logs/sky-2024-11-11-23-54-31-384822
9 sky-cmd 1 hr ago 57 mins ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-33-917343
8 sky-cmd 1 hr ago 57 mins ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-33-488413
7 sky-cmd 1 hr ago 58 mins ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-36-823567
6 sky-cmd 1 hr ago 58 mins ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-32-767428
5 sky-cmd 1 hr ago 58 mins ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-33-170796
4 sky-cmd 1 hr ago - - 1x[CPU:1+] FAILED ~/sky_logs/sky-2024-11-11-23-54-35-644537
3 sky-cmd 1 hr ago 58 mins ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-35-298117
2 sky-cmd 1 hr ago 1 hr ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-33-565838
1 sky-cmd 1 hr ago 1 hr ago 30s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-11-23-54-37-107369
I think the SSH error should be non-fatal? Is there anything else in the controller log?
This was regular sky exec
(not sky jobs launch
), so no controller logs. Only error I saw was a bunch of:
mux_client_request_session: session request failed: Session open refused by peer
ControlSocket /tmp/skypilot_ssh_2ea485ea/f8972454af/d726266e8825cc1d786f54c69678eccb59300ed1 already exists, disabling multiplexing
We can file an issue for that and leave the fix for unlimited parallel exec to a future PR?
sgtm!
Made some major changes in the cancellation. Testing again on 31bce60:
Tested (run the relevant ones):
sky jobs launch
on a small jobs controller to manually trigger OOM and see if the jobs queue can handle it correctly.sky launch -c test-queue --gpus L4
; for i in `seq 1 10`; do sky exec test-queue --gpus L4:0.25 sleep 10000; done
; sky cancel test-queue 1 2
; sky queue test-queue
pytest tests/test_smoke.py --aws
(except three tests in test_sky_bench
for subprocess.CalledProcessError: Command '['aws', 's3', 'rm', '--recursive', 's3://sky-bench-c174-gcpuser/t-sky-bench-0c']' returned non-zero exit status 1.
)conda deactivate; bash -i tests/backward_compatibility_tests.sh 1
sky launch -c test-queue --cloud aws --cpus 2 "echo hi"; for i in `seq 1 7`; do sky exec test-queue "echo hi; sleep 1000" -d; done
sky exec test-queue "echo hi; sleep 1000" -d
should fail for runtime versionsky queue; sky logs test-queue 2
should correctly runsky launch -c test-queue echo hi
sky cancel test-queue 2
; the old pending job scheduled correctlysky cancel test-queue 3 4 5
; the new pending job scheduled correctlyAn issue found:
After sky cancel
several jobs, some jobs close to that job may turn into FAILED_DRIVER
state. Hypothesis: the other jobs may be scheduled by the scheduler.schedule_step
call in the say driver of the cancelled job, which might cause the process to be within the same process group or be a children process of that cancelled job driver process, causing the job being killed by the kill_process_daemon
. We should fix this
142 sky-cmd 9 mins ago 20 secs ago 20s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-14-02-05-06-716421
141 sky-cmd 9 mins ago 20 secs ago 20s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-14-02-05-06-450754
140 sky-cmd 9 mins ago 21 secs ago 21s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-14-02-05-06-619355
139 sky-cmd 9 mins ago 22 secs ago 22s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-14-02-05-05-888792
138 sky-cmd 9 mins ago 23 secs ago 23s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-14-02-05-06-738846
137 sky-cmd 9 mins ago 23 secs ago 23s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-14-02-05-05-599533
136 sky-cmd 9 mins ago 24 secs ago 24s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-14-02-05-05-721108
135 sky-cmd 9 mins ago 25 secs ago 25s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-14-02-05-05-505060
134 sky-cmd 9 mins ago - - 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-05-04-679146
133 sky-cmd 10 mins ago 8 mins ago 7m 46s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-52-103688
132 sky-cmd 10 mins ago 8 mins ago 7m 47s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-51-440670
131 sky-cmd 10 mins ago 8 mins ago 7m 48s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-51-589677
130 sky-cmd 10 mins ago 8 mins ago 7m 49s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-51-925204
129 sky-cmd 10 mins ago 8 mins ago 8m 18s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-51-537510
128 sky-cmd 10 mins ago 8 mins ago 8m 11s 1x[CPU:1+] CANCELLED ~/sky_logs/sky-2024-11-14-02-04-50-996221
127 sky-cmd 10 mins ago 8 mins ago 8m 20s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-51-034328
126 sky-cmd 10 mins ago 8 mins ago 8m 21s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-50-943330
125 sky-cmd 10 mins ago 8 mins ago 8m 22s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-51-022167
124 sky-cmd 10 mins ago 8 mins ago 8m 23s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-50-909505
123 sky-cmd 10 mins ago 8 mins ago 8m 24s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-50-789364
122 sky-cmd 10 mins ago 8 mins ago 8m 25s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-50-775809
121 sky-cmd 10 mins ago 8 mins ago 8m 29s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-50-334967
120 sky-cmd 10 mins ago 8 mins ago 8m 30s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-49-284130
119 sky-cmd 10 mins ago 8 mins ago 8m 31s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-49-842914
118 sky-cmd 10 mins ago 8 mins ago 8m 32s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-49-201166
117 sky-cmd 10 mins ago 8 mins ago 8m 33s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-37-037409
116 sky-cmd 10 mins ago 9 mins ago 8m 34s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-37-084538
115 sky-cmd 10 mins ago 9 mins ago 8m 35s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-36-744599
114 sky-cmd 10 mins ago 9 mins ago 8m 36s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-37-003736
113 sky-cmd 10 mins ago 9 mins ago 8m 37s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-36-864517
112 sky-cmd 10 mins ago 9 mins ago 8m 38s 1x[CPU:1+] FAILED_DRIVER ~/sky_logs/sky-2024-11-14-02-04-36-710345
111 sky-cmd 10 mins ago 9 mins ago 8m 31s 1x[CPU:1+] CANCELLED ~/sky_logs/sky-2024-11-14-02-04-37-177211
The problem comes from that start_new_session
we set for starting the job process is only creating a new session, and the process itself is still a child process of the original process that calls the subprocess. This causes the children process killing destroying the other jobs as well.
To reproduce:
Cancel some of the process
See the ps faux
output on the cluster below
ubuntu 17888 0.0 0.0 7764 3456 ? Ss 03:26 0:00 /bin/bash -c echo "SKYPILOT_JOB_ID <95>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_95> ~/sky_logs/sky-2024-11-14-03-25-43-592146/r
ubuntu 17891 0.3 0.1 23941308 115528 ? Sl 03:26 0:02 \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_95
ubuntu 18021 0.0 0.0 7764 3456 ? Ss 03:26 0:00 \_ /bin/bash -c echo "SKYPILOT_JOB_ID <96>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_96> ~/sky_logs/sky-2024-11-14-03-25-43-
ubuntu 18024 0.3 0.1 23941308 115592 ? Sl 03:26 0:02 \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_96
ubuntu 18155 0.0 0.0 7764 3328 ? Ss 03:26 0:00 \_ /bin/bash -c echo "SKYPILOT_JOB_ID <97>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_97> ~/sky_logs/sky-2024-11-14-0
ubuntu 18158 0.3 0.1 23942320 115460 ? Sl 03:26 0:02 \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_97
ubuntu 18311 0.0 0.0 7764 3456 ? Ss 03:26 0:00 \_ /bin/bash -c echo "SKYPILOT_JOB_ID <98>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_98> ~/sky_logs/sky-2024
ubuntu 18314 0.3 0.1 23942320 115840 ? Sl 03:26 0:02 \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_98
After investigating the comment here: https://github.com/skypilot-org/skypilot/pull/4318#issuecomment-2475229178
It seems that all the job driver processes we run are under control, and sending SIGTERM to the process group is enough, as the driver processes will correctly clean up the underlying tasks. For example, the job process in the process group above
ubuntu 18155 0.0 0.0 7764 3328 ? Ss 03:26 0:00 \_ /bin/bash -c echo "SKYPILOT_JOB_ID <97>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_97> ~/sky_logs/sky-2024-11-14-0
ubuntu 18158 0.3 0.1 23942320 115460 ? Sl 03:26 0:02 \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_97
The job driver process starts the actual user jobs as a ray task under raylet
, which does not inherits from the driver process, and killing the driver process will have the raylet
to clean up the specific ray task:
ubuntu 20760 0.2 0.1 23643804 112152 ? SNl 03:27 0:04 \_ ray::sky-cmd,
ubuntu 20957 0.0 0.0 2892 1664 ? SNs 03:27 0:00 | \_ /bin/sh -c /bin/bash -i /tmp/sky_app_cx2gd3_p
ubuntu 20958 0.0 0.0 9188 4992 ? SN 03:27 0:00 | | \_ /bin/bash -i /tmp/sky_app_cx2gd3_p
ubuntu 21086 0.0 0.0 6192 1920 ? SN 03:27 0:00 | | \_ sleep 3600
ubuntu 20966 0.0 0.0 0 0 ? ZN 03:27 0:00 | \_ [python] <defunct>
Hence, we don't need to start a daemon for forcefully kill the process group during cancellation, which significantly reduce the time sky cancel
takes.
Our current process tree for the driver processes are not ideal, as everything is chained in a single tree, and canceling a job will split the tree into two, e.g. for the tree above, if we cancel 97, the process tree becomes
ubuntu 17891 0.2 0.1 23941308 115656 ? Sl 03:26 0:05 \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_95
ubuntu 18021 0.0 0.0 7764 3456 ? Ss 03:26 0:00 \_ /bin/bash -c echo "SKYPILOT_JOB_ID <96>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_96> ~/sky_logs/sky-2024-11-14-03-25-43-523420
ubuntu 18024 0.2 0.1 23941308 115976 ? Sl 03:26 0:05 \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_96
ubuntu 18155 0.0 0.0 0 0 ? Zs 03:26 0:00 \_ [bash] <defunct>
ubuntu 18311 0.0 0.0 7764 3456 ? Ss 03:26 0:00 /bin/bash -c echo "SKYPILOT_JOB_ID <98>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_98> ~/sky_logs/sky-2024-11-14-03-25-55-373930/run.log
ubuntu 18314 0.2 0.1 23942320 115840 ? Sl 03:26 0:05 \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_98
ubuntu 18444 0.0 0.0 7764 3456 ? Ss 03:26 0:00 \_ /bin/bash -c echo "SKYPILOT_JOB_ID <99>" && cd ~/sky_workdir && $([ -s ~/.sky/python_path ] && cat ~/.sky/python_path 2> /dev/null || which python3) -u ~/.sky/sky_app/sky_job_99> ~/sky_logs/sky-2024-11-14-03-25-56-846300
ubuntu 18447 0.2 0.1 23942332 115840 ? Sl 03:26 0:05 \_ /home/ubuntu/skypilot-runtime/bin/python -u /home/ubuntu/.sky/sky_app/sky_job_99
This works at the moment, but we should move to a more elegant solution using skylet
as the single source that start the job driver processes.
Thanks for investigating @Michaelvll. I just confirmed correctness of sky cancel
with this misbehaving script:
run: |
# Trap SIGTERM and ignore it
trap "" SIGTERM
for ((i=1; i<=3600; i++)); do
echo "Count: $i"
echo "Count: $i" >> /tmp/count.txt
sleep 1
done
sky cancel
indeed kills the process.
With 5650d26, we are now able to avoid the chain of processes. : )
sky launch -c t-d examples/resnet_distributed_torch.yaml
; ssh t-d-worker1 nvidia-smi
; sky cancel t-d 1
; ssh t-d-worker1 nvidia-smi`pytest tests/test_smoke.py --aws
(except three tests in test_sky_bench
for subprocess.CalledProcessError: Command '['aws', 's3', 'rm', '--recursive', 's3://sky-bench-c174-gcpuser/t-sky-bench-0c']' returned non-zero exit status 1.
)conda deactivate; bash -i tests/backward_compatibility_tests.sh 1
sky launch -c test-queue --cloud aws --cpus 2 "echo hi"; for i in `seq 1 7`; do sky exec test-queue "echo hi; sleep 1000" -d; done
sky exec test-queue "echo hi; sleep 1000" -d
should fail for runtime versionsky queue; sky logs test-queue 2
should correctly runsky launch -c test-queue echo hi
sky cancel test-queue 2
; the old pending job scheduled correctlysky cancel test-queue 3 4 5
; the new pending job scheduled correctly
Fixes #4295 and Fixes #4293
ray job
has introduced a significant delay in our job submission and additional memory consumption. Although Ray job may provide some safeguard for abnormally failed jobs, it does not provide much value for our job management when the status is handled carefully in our own job table. In this PR, we replace ray job submit with subprocess and add a new stateFAILED_DRIVER
for the jobs to distinguish the user program failure vs job driver failure (such as OOM).Scheduling Speed for unmanaged jobs
The job scheduling is much faster: 60 seconds for 23 jobs (#4310) -> 25 seconds for 32 jobs (after reducing the CPU/job, it can be 60 seconds for 67 jobs), i.e. 0.38jobs/s -> 1.1 jobs/s (~3x faster)
job queue with this PR
``` 75 sky-cmd a few secs ago - - 1x[CPU:1+] PENDING ~/sky_logs/sky-2024-11-09-21-57-52-131459 74 sky-cmd a few secs ago < 1 sec < 1s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-50-439630 73 sky-cmd a few secs ago a few secs ago 2s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-50-398196 72 sky-cmd a few secs ago a few secs ago 3s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-46-160367 71 sky-cmd a few secs ago a few secs ago 4s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-45-831961 70 sky-cmd a few secs ago a few secs ago 5s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-45-193824 69 sky-cmd a few secs ago a few secs ago 6s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-44-775659 68 sky-cmd a few secs ago a few secs ago 6s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-45-117409 67 sky-cmd a few secs ago a few secs ago 7s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-44-608202 66 sky-cmd a few secs ago a few secs ago 8s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-43-623994 65 sky-cmd a few secs ago a few secs ago 9s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-42-985405 64 sky-cmd 15 secs ago 11 secs ago 11s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-38-759178 63 sky-cmd 15 secs ago 12 secs ago 12s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-38-652951 62 sky-cmd 15 secs ago 13 secs ago 13s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-38-725297 61 sky-cmd 16 secs ago 14 secs ago 14s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-37-587498 60 sky-cmd 16 secs ago 15 secs ago 15s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-37-990283 59 sky-cmd 16 secs ago 14 secs ago 14s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-37-744277 58 sky-cmd 17 secs ago 15 secs ago 15s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-36-753063 57 sky-cmd 17 secs ago 15 secs ago 15s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-36-662268 56 sky-cmd 22 secs ago 17 secs ago 17s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-31-550790 55 sky-cmd 22 secs ago 18 secs ago 18s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-31-190262 54 sky-cmd 23 secs ago 19 secs ago 19s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-31-334542 53 sky-cmd 23 secs ago 20 secs ago 20s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-30-513732 52 sky-cmd 23 secs ago 21 secs ago 21s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-30-478628 51 sky-cmd 23 secs ago 21 secs ago 21s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-30-467094 50 sky-cmd 24 secs ago 22 secs ago 22s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-29-433253 49 sky-cmd 24 secs ago 23 secs ago 23s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-29-319268 48 sky-cmd 29 secs ago 24 secs ago 24s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-25-158885 47 sky-cmd 29 secs ago 25 secs ago 25s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-25-001693 46 sky-cmd 29 secs ago 26 secs ago 26s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-24-605389 45 sky-cmd 30 secs ago 27 secs ago 27s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-24-055194 44 sky-cmd 30 secs ago 27 secs ago 27s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-23-610780 43 sky-cmd 31 secs ago 28 secs ago 28s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-22-962817 42 sky-cmd 31 secs ago 29 secs ago 29s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-22-456464 41 sky-cmd 31 secs ago 30 secs ago 30s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-22-371177 40 sky-cmd 35 secs ago 31 secs ago 31s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-17-681144 39 sky-cmd 35 secs ago 32 secs ago 32s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-17-808512 38 sky-cmd 36 secs ago 33 secs ago 33s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-17-623907 37 sky-cmd 36 secs ago 34 secs ago 34s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-16-894306 36 sky-cmd 37 secs ago 34 secs ago 34s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-16-245718 35 sky-cmd 37 secs ago 35 secs ago 35s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-16-223589 34 sky-cmd 38 secs ago 36 secs ago 36s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-15-132803 33 sky-cmd 38 secs ago 37 secs ago 37s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-15-401731 32 sky-cmd 43 secs ago 38 secs ago 38s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-10-874506 31 sky-cmd 43 secs ago 39 secs ago 39s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-10-813577 30 sky-cmd 43 secs ago 40 secs ago 40s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-10-840665 29 sky-cmd 44 secs ago 41 secs ago 41s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-09-734297 28 sky-cmd 44 secs ago 41 secs ago 41s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-09-642548 27 sky-cmd 44 secs ago 42 secs ago 42s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-09-273863 26 sky-cmd 45 secs ago 43 secs ago 43s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-07-539143 25 sky-cmd 45 secs ago 44 secs ago 44s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-07-453114 24 sky-cmd 50 secs ago 47 secs ago 47s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-02-913495 23 sky-cmd 50 secs ago 47 secs ago 47s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-02-727743 22 sky-cmd 50 secs ago 48 secs ago 48s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-03-117854 21 sky-cmd 51 secs ago 49 secs ago 49s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-02-473544 20 sky-cmd 51 secs ago 50 secs ago 50s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-01-902662 19 sky-cmd 51 secs ago 50 secs ago 50s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-01-557019 18 sky-cmd 52 secs ago 51 secs ago 51s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-57-00-507900 17 sky-cmd 53 secs ago 52 secs ago 52s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-59-803208 16 sky-cmd 58 secs ago 54 secs ago 54s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-55-498089 15 sky-cmd 58 secs ago 55 secs ago 55s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-55-517846 14 sky-cmd 58 secs ago 56 secs ago 56s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-55-392485 13 sky-cmd 58 secs ago 56 secs ago 56s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-55-580883 12 sky-cmd 58 secs ago 57 secs ago 57s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-55-577158 11 sky-cmd 59 secs ago 57 secs ago 57s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-54-964683 10 sky-cmd 59 secs ago 58 secs ago 58s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-53-712027 9 sky-cmd 1 min ago 59 secs ago 59s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-53-476885 8 sky-cmd 1 min ago 1 min ago 1m 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-21-56-47-424912 7 sky-cmd 1 min ago 1 min ago 1m 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-09-21-56-47-583078 ```Scheduling speed for managed jobs
Fixes #4294, it can now keep up with the job submitting, where managed jobs scheduling speed is 29s -> 3.4s, i.e. 8.5x faster.
Memory consumption
32 jobs running in parallel and many jobs PENDING and being submitted
The memory consumption issue relates to #4334
Master: 8.0G This PR: 6.2G
More jobs Master: 58GB / 264 jobs (0.21GB/job) This PR: 42.3GB / 264 jobs (0.16GB/job)
Correctness/Robustness
Job can get into the following situations:
Tested (run the relevant ones):
bash format.sh
sky jobs launch
on a small jobs controller to manually trigger OOM and see if the jobs queue can handle it correctly.pytest tests/test_smoke.py --aws
(except three tests in https://github.com/skypilot-org/skypilot/pull/4198#issuecomment-2466531331 andtest_sky_bench
forsubprocess.CalledProcessError: Command '['aws', 's3', 'rm', '--recursive', 's3://sky-bench-c174-gcpuser/t-sky-bench-0c']' returned non-zero exit status 1.
)pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh 1
sky launch -c test-queue --cloud aws --cpus 2 "echo hi"; for i in `seq 1 7`; do sky exec test-queue "echo hi; sleep 1000" -d; done
sky exec test-queue "echo hi; sleep 1000" -d
should fail for runtime versionsky queue; sky logs tset-queue 2
should correctly runsky launch -c test-queue echo hi
sky cancel test-queue 2
; the old pending job scheduled correctlysky cancel test-queue 3 4 5
; the new pending job scheduled correctly