Closed cg505 closed 3 weeks ago
Submitted 10 jobs in ~40s - nice!
for i in {1..10}; do
sky jobs launch -d -y --fast --cpus 2+ -- echo hi2 &
done
wait
However, the controller runs only first few jobs then fails. Probably unrelated to this PR:
Managed jobs
No in-progress managed jobs.
ID TASK NAME RESOURCES SUBMITTED TOT. DURATION JOB DURATION #RECOVERIES STATUS
17 - sky-cmd 1x[CPU:2+] 5 mins ago 4m 50s - 0 FAILED_CONTROLLER
16 - sky-cmd 1x[CPU:2+] 5 mins ago 4m 54s - 0 FAILED_CONTROLLER
15 - sky-cmd 1x[CPU:2+] 5 mins ago 5m 5s - 0 FAILED_CONTROLLER
14 - sky-cmd 1x[CPU:2+] 6 mins ago 5m 9s - 0 FAILED_CONTROLLER
13 - sky-cmd 1x[CPU:2+] 6 mins ago 5m 15s - 0 FAILED_CONTROLLER
12 - sky-cmd 1x[CPU:2+] 6 mins ago 5m 24s - 0 FAILED_CONTROLLER
11 - sky-cmd 1x[CPU:2+] 6 mins ago 1m 1s 5s 0 SUCCEEDED
10 - sky-cmd 1x[CPU:2+] 6 mins ago 1m 2s 5s 0 SUCCEEDED
9 - sky-cmd 1x[CPU:2+] 6 mins ago 1m 4s 6s 0 SUCCEEDED
8 - sky-cmd 1x[CPU:2+] 6 mins ago 1m 13s 6s 0 SUCCEEDED
7 - sky-cmd 1x[CPU:2+] 13 mins ago 1m 47s - 0 FAILED_CONTROLLER
6 - sky-cmd 1x[CPU:2+] 13 mins ago 1m 1s 5s 0 SUCCEEDED
5 - sky-cmd 1x[CPU:2+] 13 mins ago 1m 2s 5s 0 SUCCEEDED
4 - sky-cmd 1x[CPU:2+] 13 mins ago 1m 3s 5s 0 SUCCEEDED
3 - sky-cmd 1x[CPU:2+] 13 mins ago 1m 5s 5s 0 SUCCEEDED
2 - sky-cmd 1x[CPU:1+] 18 mins ago 58s 4s 0 SUCCEEDED
1 - sky-cmd 1x[CPU:1+] 21 mins ago 1m 12s 4s 0 SUCCEEDED
sky jobs logs --controller
isn't very helpful:
(base) ➜ ~ sky jobs logs --controller 16
D 10-31 12:29:36 skypilot_config.py:228] Using config path: /Users/romilb/.sky/config.yaml
D 10-31 12:29:36 skypilot_config.py:233] Config loaded:
D 10-31 12:29:36 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
D 10-31 12:29:36 skypilot_config.py:233] 'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
D 10-31 12:29:36 skypilot_config.py:233] 'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
D 10-31 12:29:36 skypilot_config.py:233] 'value': 'my_value'}]}]}}}}
D 10-31 12:29:36 skypilot_config.py:245] Config syntax check passed.
D 10-31 12:29:37 backend_utils.py:1937] Refreshing status: Failed get the lock for cluster 'sky-jobs-controller-2ea485ea'. Using the cached status.
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:228] Using config path: /home/gcpuser/.sky/managed_jobs/sky-cmd-3008.config_yaml
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233] Config loaded:
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233] 'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233] 'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233] 'value': 'my_value'}]}]}}}}
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:245] Config syntax check passed.
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:228] Using config path: /home/gcpuser/.sky/managed_jobs/sky-cmd-3008.config_yaml
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233] Config loaded:
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233] 'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233] 'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233] 'value': 'my_value'}]}]}}}}
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:245] Config syntax check passed.
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53] DAG:
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53] [Task<name=sky-cmd>(run='echo hi2')
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53] resources: <Cloud>(cpus=2+)]
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:180] Submitted managed job 16 (task: 0, name: 'sky-cmd'); SKYPILOT_TASK_ID: sky-managed-2024-10-31-19-22-02-711936_sky-cmd_16-0
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:184] Started monitoring.
(sky-cmd, pid=15103) I 10-31 19:22:02 state.py:337] Launching the spot cluster...
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:146] User config: allowed_clouds -> ['aws', 'gcp']
(sky-cmd, pid=15103) D 10-31 19:22:02 optimizer.py:292] #### Task<name=sky-cmd>(run='echo hi2')
(sky-cmd, pid=15103) D 10-31 19:22:02 optimizer.py:292] resources: <Cloud>(cpus=2+) ####
This flag will make the jobs controller launch use
sky launch --fast
. There are a few known situations where this can cause misbehavior in the jobs controller:sky check
, the cloud depedencies may not be correctly installed.However, this does speed up
jobs launch
significantly, so provide it as a dangerous option. Soon we will add robustness checks tosky launch --fast
that will fix the above caveats, and we can remove this flag and just enable the behavior by default.Tested (run the relevant ones):
bash format.sh
conda deactivate; bash -i tests/backward_compatibility_tests.sh