skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 514 forks source link

[ux] add sky jobs launch --fast #4231

Closed cg505 closed 3 weeks ago

cg505 commented 3 weeks ago

This flag will make the jobs controller launch use sky launch --fast. There are a few known situations where this can cause misbehavior in the jobs controller:

However, this does speed up jobs launch significantly, so provide it as a dangerous option. Soon we will add robustness checks to sky launch --fast that will fix the above caveats, and we can remove this flag and just enable the behavior by default.

Tested (run the relevant ones):

romilbhardwaj commented 3 weeks ago

Submitted 10 jobs in ~40s - nice!

for i in {1..10}; do
  sky jobs launch -d -y --fast --cpus 2+ -- echo hi2 &
done
wait

However, the controller runs only first few jobs then fails. Probably unrelated to this PR:

Managed jobs
No in-progress managed jobs.
ID  TASK  NAME     RESOURCES   SUBMITTED    TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
17  -     sky-cmd  1x[CPU:2+]  5 mins ago   4m 50s         -             0            FAILED_CONTROLLER
16  -     sky-cmd  1x[CPU:2+]  5 mins ago   4m 54s         -             0            FAILED_CONTROLLER
15  -     sky-cmd  1x[CPU:2+]  5 mins ago   5m 5s          -             0            FAILED_CONTROLLER
14  -     sky-cmd  1x[CPU:2+]  6 mins ago   5m 9s          -             0            FAILED_CONTROLLER
13  -     sky-cmd  1x[CPU:2+]  6 mins ago   5m 15s         -             0            FAILED_CONTROLLER
12  -     sky-cmd  1x[CPU:2+]  6 mins ago   5m 24s         -             0            FAILED_CONTROLLER
11  -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 1s          5s            0            SUCCEEDED
10  -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 2s          5s            0            SUCCEEDED
9   -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 4s          6s            0            SUCCEEDED
8   -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 13s         6s            0            SUCCEEDED
7   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 47s         -             0            FAILED_CONTROLLER
6   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 1s          5s            0            SUCCEEDED
5   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 2s          5s            0            SUCCEEDED
4   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 3s          5s            0            SUCCEEDED
3   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 5s          5s            0            SUCCEEDED
2   -     sky-cmd  1x[CPU:1+]  18 mins ago  58s            4s            0            SUCCEEDED
1   -     sky-cmd  1x[CPU:1+]  21 mins ago  1m 12s         4s            0            SUCCEEDED

sky jobs logs --controller isn't very helpful:

(base) ➜  ~ sky jobs logs --controller 16
D 10-31 12:29:36 skypilot_config.py:228] Using config path: /Users/romilb/.sky/config.yaml
D 10-31 12:29:36 skypilot_config.py:233] Config loaded:
D 10-31 12:29:36 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
D 10-31 12:29:36 skypilot_config.py:233]  'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
D 10-31 12:29:36 skypilot_config.py:233]  'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
D 10-31 12:29:36 skypilot_config.py:233]                                                                  'value': 'my_value'}]}]}}}}
D 10-31 12:29:36 skypilot_config.py:245] Config syntax check passed.
D 10-31 12:29:37 backend_utils.py:1937] Refreshing status: Failed get the lock for cluster 'sky-jobs-controller-2ea485ea'. Using the cached status.
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:228] Using config path: /home/gcpuser/.sky/managed_jobs/sky-cmd-3008.config_yaml
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233] Config loaded:
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233]  'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233]  'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233]                                                                  'value': 'my_value'}]}]}}}}
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:245] Config syntax check passed.
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:228] Using config path: /home/gcpuser/.sky/managed_jobs/sky-cmd-3008.config_yaml
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233] Config loaded:
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233]  'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233]  'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233]                                                                  'value': 'my_value'}]}]}}}}
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:245] Config syntax check passed.
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53] DAG:
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53] [Task<name=sky-cmd>(run='echo hi2')
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53]   resources: <Cloud>(cpus=2+)]
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:180] Submitted managed job 16 (task: 0, name: 'sky-cmd'); SKYPILOT_TASK_ID: sky-managed-2024-10-31-19-22-02-711936_sky-cmd_16-0
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:184] Started monitoring.
(sky-cmd, pid=15103) I 10-31 19:22:02 state.py:337] Launching the spot cluster...
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:146] User config: allowed_clouds -> ['aws', 'gcp']
(sky-cmd, pid=15103) D 10-31 19:22:02 optimizer.py:292] #### Task<name=sky-cmd>(run='echo hi2')
(sky-cmd, pid=15103) D 10-31 19:22:02 optimizer.py:292]   resources: <Cloud>(cpus=2+) ####