SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Previously, global optimization was only performed on the user's machine for display purposes, and its results were not propagated to the controller. This PR adds global optimization on the controller side, which is particularly important when data transfer costs are involved (#4320).
To illustrate the changes of global optimization, we need to test it with #4320, since without data consideration, the global optimization would not make a difference compared to single node optimization. Only with data dependencies, can a node's position choosing affect another.
import sky
from sky import Resources
from sky import clouds
from sky.optimizer import OptimizeTarget
import sky.jobs
with sky.Dag() as dag:
task1 = sky.Task(name="task1", run="echo 'Hello, world!'")
task1.set_resources(Resources(cpus=8, cloud=clouds.GCP()))
task2 = sky.Task(name="task2", run="echo 'Hello, world!'")
task2.set_resources(Resources(cpus=8))
(task1 >> task2).with_data('/tmp/data', '/tmp/data', 30)
sky.jobs.launch(dag)
After running the above code, our new implementation gives different behaviors:
Old Behavior
andyl@DESKTOP-7FP6SMO ~/skypilot (advanced-dag)> sky jobs logs 6 --controll
er
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-a412-andyl, pid=635293) I 11-14 22:14:13 controller.py:576] Controller process 635841 started.
(sky-a412-andyl, pid=635293) I 11-14 22:14:14 controller.py:62] DAG(sky-a412-andyl: task1(task2) task2(-))
(sky-a412-andyl, pid=635293) I 11-14 22:14:14 controller.py:462] Task 0 is submitted to run. To see logs: sky jobs logs 6 --task-id 0
(sky-a412-andyl, pid=635293) I 11-14 22:14:14 controller.py:467] Redirecting output to /home/sky/sky_logs/managed_jobs_6/task_0_launch.log.
(sky-a412-andyl, pid=635293) I 11-14 22:14:14 controller.py:204] Submitted managed job 6 (task: 0, name: 'task1'); SKYPILOT_TASK_ID: sky-managed-2024-11-14-22-14-14-432299_sky-a412-andyl_task1_6-0
(sky-a412-andyl, pid=635293) I 11-14 22:14:14 controller.py:208] Started monitoring.
(sky-a412-andyl, pid=635293) I 11-14 22:14:14 state.py:352] Launching the spot cluster...
(sky-a412-andyl, pid=635293) I 11-14 22:14:17 optimizer.py:830] Target: minimizing cost
(sky-a412-andyl, pid=635293) I 11-14 22:14:17 optimizer.py:843] Estimated cost: $0.4 / hour
(sky-a412-andyl, pid=635293) I 11-14 22:14:17 optimizer.py:843]
(sky-a412-andyl, pid=635293) I 11-14 22:14:17 optimizer.py:776] Print egress plan
(sky-a412-andyl, pid=635293) I 11-14 22:14:17 optimizer.py:978] Considered resources (1 node):
(sky-a412-andyl, pid=635293) I 11-14 22:14:17 optimizer.py:1048] ----------------------------------------------------------------------------------------------
(sky-a412-andyl, pid=635293) I 11-14 22:14:17 optimizer.py:1048] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
(sky-a412-andyl, pid=635293) I 11-14 22:14:17 optimizer.py:1048] ----------------------------------------------------------------------------------------------
(sky-a412-andyl, pid=635293) I 11-14 22:14:17 optimizer.py:1048] GCP n2-standard-8 8 32 - us-central1-a 0.39 ✔
(sky-a412-andyl, pid=635293) I 11-14 22:14:17 optimizer.py:1048] ----------------------------------------------------------------------------------------------
(sky-a412-andyl, pid=635293) I 11-14 22:14:24 cloud_vm_ray_backend.py:1505] ⚙︎ Launching on GCP us-central1 (us-central1-a).
(sky-a412-andyl, pid=635293) I 11-14 22:15:06 provisioner.py:445] └── Instance is up.
(sky-a412-andyl, pid=635293) I 11-14 22:15:36 provisioner.py:550] ✓ Cluster launched: task1-6. View logs at: ~/sky_logs/sky-2024-11-14-22-14-14-458031/provision.log
(sky-a412-andyl, pid=635293) I 11-14 22:15:36 execution.py:303] ⚙︎ Mounting files.
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3360] ⚙︎ Job submitted, ID: 1
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3418]
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3418] Job ID: 1
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3418] 📋 Useful Commands
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3418] ├── To cancel the job: sky cancel task1-6 1
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3418] ├── To stream job logs: sky logs task1-6 1
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3418] └── To view job queue: sky queue task1-6
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3511]
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3511] Cluster name: task1-6
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3511] ├── To log into the head VM: ssh task1-6
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3511] ├── To submit a job: sky exec task1-6 yaml_file
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3511] ├── To stop the cluster: sky stop task1-6
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 cloud_vm_ray_backend.py:3511] └── To teardown the cluster: sky down task1-6
(sky-a412-andyl, pid=635293)
(sky-a412-andyl, pid=635293) I 11-14 22:15:41 recovery_strategy.py:321] Managed job cluster launched.
(sky-a412-andyl, pid=635293) I 11-14 22:15:45 utils.py:94] === Checking the job status... ===
(sky-a412-andyl, pid=635293) I 11-14 22:15:45 utils.py:100] Job status: JobStatus.SUCCEEDED
(sky-a412-andyl, pid=635293) I 11-14 22:15:45 utils.py:103] ==================================
(sky-a412-andyl, pid=635293) I 11-14 22:15:46 state.py:365] Job started.
(sky-a412-andyl, pid=635293) I 11-14 22:16:07 utils.py:94] === Checking the job status... ===
(sky-a412-andyl, pid=635293) I 11-14 22:16:08 utils.py:100] Job status: JobStatus.SUCCEEDED
(sky-a412-andyl, pid=635293) I 11-14 22:16:08 utils.py:103] ==================================
(sky-a412-andyl, pid=635293) I 11-14 22:16:09 state.py:425] Job succeeded.
(sky-a412-andyl, pid=635293) I 11-14 22:16:09 controller.py:245] Managed job 6 (task: 0) SUCCEEDED. Cleaning up the cluster task1-6.
(sky-a412-andyl, pid=635293) I 11-14 22:16:45 controller.py:479] Task 0 completed.
(sky-a412-andyl, pid=635293) I 11-14 22:16:45 controller.py:419] Task 0 completed with result: True
(sky-a412-andyl, pid=635293) I 11-14 22:16:45 controller.py:462] Task 1 is submitted to run. To see logs: sky jobs logs 6 --task-id 1
(sky-a412-andyl, pid=635293) I 11-14 22:16:45 controller.py:467] Redirecting output to /home/sky/sky_logs/managed_jobs_6/task_1_launch.log.
(sky-a412-andyl, pid=635293) I 11-14 22:16:45 controller.py:204] Submitted managed job 6 (task: 1, name: 'task2'); SKYPILOT_TASK_ID: sky-managed-2024-11-14-22-14-14-432299_sky-a412-andyl_task2_6-1
(sky-a412-andyl, pid=635293) I 11-14 22:16:45 controller.py:208] Started monitoring.
(sky-a412-andyl, pid=635293) I 11-14 22:16:45 state.py:352] Launching the spot cluster...
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:830] Target: minimizing cost
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:843] Estimated cost: $0.0 / hour
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:843]
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:776] Print egress plan
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:978] Considered resources (1 node):
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:1048] ---------------------------------------------------------------------------------------------------
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:1048] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:1048] ---------------------------------------------------------------------------------------------------
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:1048] Kubernetes 8CPU--8GB 8 8 - in-cluster 0.00 ✔
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:1048] AWS m6i.2xlarge 8 32 - us-east-1 0.38
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:1048] GCP n2-standard-8 8 32 - us-central1-a 0.39
(sky-a412-andyl, pid=635293) I 11-14 22:16:50 optimizer.py:1048] ---------------------------------------------------------------------------------------------------
(sky-a412-andyl, pid=635293) I 11-14 22:16:51 cloud_vm_ray_backend.py:1500] ⚙︎ Launching on Kubernetes.
(sky-a412-andyl, pid=635293) W 11-14 22:16:51 cloud_vm_ray_backend.py:2017] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in in-cluster for {(cpus=8)}.
(sky-a412-andyl, pid=635293) W 11-14 22:16:51 cloud_vm_ray_backend.py:2051]
(sky-a412-andyl, pid=635293) W 11-14 22:16:51 cloud_vm_ray_backend.py:2051] ↺ Trying other potential resources.
(sky-a412-andyl, pid=635293) I 11-14 22:16:54 optimizer.py:830] Target: minimizing cost
(sky-a412-andyl, pid=635293) I 11-14 22:16:54 optimizer.py:843] Estimated cost: $0.4 / hour
(sky-a412-andyl, pid=635293) I 11-14 22:16:54 optimizer.py:843]
(sky-a412-andyl, pid=635293) I 11-14 22:16:54 optimizer.py:776] Print egress plan
(sky-a412-andyl, pid=635293) I 11-14 22:16:54 optimizer.py:978] Considered resources (1 node):
(sky-a412-andyl, pid=635293) I 11-14 22:16:54 optimizer.py:1048] ----------------------------------------------------------------------------------------------
(sky-a412-andyl, pid=635293) I 11-14 22:16:54 optimizer.py:1048] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
(sky-a412-andyl, pid=635293) I 11-14 22:16:54 optimizer.py:1048] ----------------------------------------------------------------------------------------------
(sky-a412-andyl, pid=635293) I 11-14 22:16:54 optimizer.py:1048] AWS m6i.2xlarge 8 32 - us-east-1 0.38 ✔
(sky-a412-andyl, pid=635293) I 11-14 22:16:54 optimizer.py:1048] GCP n2-standard-8 8 32 - us-central1-a 0.39
(sky-a412-andyl, pid=635293) I 11-14 22:16:54 optimizer.py:1048] ----------------------------------------------------------------------------------------------
(sky-a412-andyl, pid=635293) I 11-14 22:16:55 cloud_vm_ray_backend.py:1505] ⚙︎ Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f).
(sky-a412-andyl, pid=635293) I 11-14 22:17:25 provisioner.py:445] └── Instance is up.
(sky-a412-andyl, pid=635293) I 11-14 22:17:59 provisioner.py:550] ✓ Cluster launched: task2-6. View logs at: ~/sky_logs/sky-2024-11-14-22-16-45-879209/provision.log
(sky-a412-andyl, pid=635293) I 11-14 22:17:59 execution.py:303] ⚙︎ Mounting files.
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3360] ⚙︎ Job submitted, ID: 1
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3418]
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3418] Job ID: 1
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3418] 📋 Useful Commands
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3418] ├── To cancel the job: sky cancel task2-6 1
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3418] ├── To stream job logs: sky logs task2-6 1
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3418] └── To view job queue: sky queue task2-6
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3511]
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3511] Cluster name: task2-6
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3511] ├── To log into the head VM: ssh task2-6
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3511] ├── To submit a job: sky exec task2-6 yaml_file
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3511] ├── To stop the cluster: sky stop task2-6
(sky-a412-andyl, pid=635293) I 11-14 22:18:03 cloud_vm_ray_backend.py:3511] └── To teardown the cluster: sky down task2-6
(sky-a412-andyl, pid=635293)
(sky-a412-andyl, pid=635293) I 11-14 22:18:04 recovery_strategy.py:321] Managed job cluster launched.
(sky-a412-andyl, pid=635293) I 11-14 22:18:06 utils.py:94] === Checking the job status... ===
(sky-a412-andyl, pid=635293) I 11-14 22:18:07 utils.py:100] Job status: JobStatus.SUCCEEDED
(sky-a412-andyl, pid=635293) I 11-14 22:18:07 utils.py:103] ==================================
(sky-a412-andyl, pid=635293) I 11-14 22:18:07 state.py:365] Job started.
(sky-a412-andyl, pid=635293) I 11-14 22:18:28 utils.py:94] === Checking the job status... ===
(sky-a412-andyl, pid=635293) I 11-14 22:18:29 utils.py:100] Job status: JobStatus.SUCCEEDED
(sky-a412-andyl, pid=635293) I 11-14 22:18:29 utils.py:103] ==================================
(sky-a412-andyl, pid=635293) I 11-14 22:18:29 state.py:425] Job succeeded.
(sky-a412-andyl, pid=635293) I 11-14 22:18:29 controller.py:245] Managed job 6 (task: 1) SUCCEEDED. Cleaning up the cluster task2-6.
(sky-a412-andyl, pid=635293) I 11-14 22:18:34 controller.py:479] Task 1 completed.
(sky-a412-andyl, pid=635293) I 11-14 22:18:34 controller.py:419] Task 1 completed with result: True
(sky-a412-andyl, pid=635293) I 11-14 22:18:35 controller.py:593] Killing controller process 635841.
(sky-a412-andyl, pid=635293) I 11-14 22:18:35 controller.py:601] Controller process 635841 killed.
(sky-a412-andyl, pid=635293) I 11-14 22:18:35 controller.py:603] Cleaning up any cluster for job 6.
(sky-a412-andyl, pid=635293) I 11-14 22:18:35 controller.py:612] Cluster of managed job 6 has been cleaned up.
✓ Job finished (status: SUCCEEDED).
New Behavior
andyl@DESKTOP-7FP6SMO ~/skypilot (advanced-dag)> sky jobs logs 7 --controll
er
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-f74c-andyl, pid=635289) I 11-14 22:24:24 controller.py:576] Controller process 711571 started.
(sky-f74c-andyl, pid=635289) I 11-14 22:24:25 controller.py:62] DAG(sky-f74c-andyl: task1(task2) task2(-))
(sky-f74c-andyl, pid=635289) I 11-14 22:24:25 optimizer.py:231] Adding storage node between task1 and task2
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:830] Target: minimizing cost
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:847] Estimated total runtime: 2.0 hours
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:847] Estimated total cost: $0.8
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:847]
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:950] Best plan:
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:955] ------------------------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:955] TASK #NODES CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:955] ------------------------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:955] task1 1 GCP n2-standard-8 8 32 - us-central1-a
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:955] task1_to_task2_storage 1 GCP n2-standard-8 8 32 - us-central1-a
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:955] task2 1 GCP n2-standard-8 8 32 - us-central1-a
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:955] ------------------------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:776] Print egress plan
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:977]
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:978] Considered resources for task 'task1' (1 node):
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] ----------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] ----------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] GCP n2-standard-8 8 32 - us-central1-a 0.39 ✔
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] ----------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:977]
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:978] Considered resources for task 'task1_to_task2_storage' (1 node):
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] ---------------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] ---------------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] GCP n2-standard-8 8 32 - us-central1-a 0.00 ✔
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] AWS m6i.2xlarge 8 32 - us-east-1 0.00
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] Kubernetes 2CPU--2GB 2 2 - in-cluster 0.00
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:1048] ---------------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:977]
(sky-f74c-andyl, pid=635289) I 11-14 22:24:33 optimizer.py:978] Considered resources for task 'task2' (1 node):
(sky-f74c-andyl, pid=635289) I 11-14 22:24:34 optimizer.py:1048] ---------------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:34 optimizer.py:1048] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
(sky-f74c-andyl, pid=635289) I 11-14 22:24:34 optimizer.py:1048] ---------------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:34 optimizer.py:1048] Kubernetes 8CPU--8GB 8 8 - in-cluster 0.00
(sky-f74c-andyl, pid=635289) I 11-14 22:24:34 optimizer.py:1048] AWS m6i.2xlarge 8 32 - us-east-1 0.38
(sky-f74c-andyl, pid=635289) I 11-14 22:24:34 optimizer.py:1048] GCP n2-standard-8 8 32 - us-central1-a 0.39 ✔
(sky-f74c-andyl, pid=635289) I 11-14 22:24:34 optimizer.py:1048] ---------------------------------------------------------------------------------------------------
(sky-f74c-andyl, pid=635289) I 11-14 22:24:37 storage.py:1923] Created GCS bucket 'bucket-for-task1-to-task2-6eabc0cb-5b69b573-1ba2-4513-bc74-405e' in US-CENTRAL1 with storage class STANDARD
(sky-f74c-andyl, pid=635289) I 11-14 22:24:37 controller.py:462] Task 0 is submitted to run. To see logs: sky jobs logs 7 --task-id 0
(sky-f74c-andyl, pid=635289) I 11-14 22:24:37 controller.py:467] Redirecting output to /home/sky/sky_logs/managed_jobs_7/task_0_launch.log.
(sky-f74c-andyl, pid=635289) I 11-14 22:24:37 controller.py:204] Submitted managed job 7 (task: 0, name: 'task1'); SKYPILOT_TASK_ID: sky-managed-2024-11-14-22-24-25-186083_sky-f74c-andyl_task1_7-0
(sky-f74c-andyl, pid=635289) I 11-14 22:24:37 controller.py:208] Started monitoring.
(sky-f74c-andyl, pid=635289) I 11-14 22:24:37 state.py:352] Launching the spot cluster...
(sky-f74c-andyl, pid=635289) I 11-14 22:24:43 cloud_vm_ray_backend.py:1505] ⚙︎ Launching on GCP us-central1 (us-central1-a).
(sky-f74c-andyl, pid=635289) I 11-14 22:25:27 provisioner.py:445] └── Instance is up.
(sky-f74c-andyl, pid=635289) I 11-14 22:25:55 provisioner.py:550] ✓ Cluster launched: task1-7. View logs at: ~/sky_logs/sky-2024-11-14-22-24-37-249479/provision.log
(sky-f74c-andyl, pid=635289) I 11-14 22:25:55 execution.py:303] ⚙︎ Mounting files.
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3360] ⚙︎ Job submitted, ID: 1
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3418]
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3418] Job ID: 1
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3418] 📋 Useful Commands
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3418] ├── To cancel the job: sky cancel task1-7 1
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3418] ├── To stream job logs: sky logs task1-7 1
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3418] └── To view job queue: sky queue task1-7
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3511]
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3511] Cluster name: task1-7
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3511] ├── To log into the head VM: ssh task1-7
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3511] ├── To submit a job: sky exec task1-7 yaml_file
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3511] ├── To stop the cluster: sky stop task1-7
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 cloud_vm_ray_backend.py:3511] └── To teardown the cluster: sky down task1-7
(sky-f74c-andyl, pid=635289)
(sky-f74c-andyl, pid=635289) I 11-14 22:25:59 recovery_strategy.py:321] Managed job cluster launched.
(sky-f74c-andyl, pid=635289) I 11-14 22:26:03 utils.py:94] === Checking the job status... ===
(sky-f74c-andyl, pid=635289) I 11-14 22:26:03 utils.py:100] Job status: JobStatus.SUCCEEDED
(sky-f74c-andyl, pid=635289) I 11-14 22:26:03 utils.py:103] ==================================
(sky-f74c-andyl, pid=635289) I 11-14 22:26:04 state.py:365] Job started.
(sky-f74c-andyl, pid=635289) I 11-14 22:26:24 utils.py:94] === Checking the job status... ===
(sky-f74c-andyl, pid=635289) I 11-14 22:26:25 utils.py:100] Job status: JobStatus.SUCCEEDED
(sky-f74c-andyl, pid=635289) I 11-14 22:26:25 utils.py:103] ==================================
(sky-f74c-andyl, pid=635289) I 11-14 22:26:26 state.py:425] Job succeeded.
(sky-f74c-andyl, pid=635289) I 11-14 22:26:26 controller.py:245] Managed job 7 (task: 0) SUCCEEDED. Cleaning up the cluster task1-7.
(sky-f74c-andyl, pid=635289) I 11-14 22:26:59 controller.py:479] Task 0 completed.
(sky-f74c-andyl, pid=635289) I 11-14 22:26:59 controller.py:419] Task 0 completed with result: True
(sky-f74c-andyl, pid=635289) I 11-14 22:26:59 controller.py:462] Task 1 is submitted to run. To see logs: sky jobs logs 7 --task-id 1
(sky-f74c-andyl, pid=635289) I 11-14 22:26:59 controller.py:467] Redirecting output to /home/sky/sky_logs/managed_jobs_7/task_1_launch.log.
(sky-f74c-andyl, pid=635289) I 11-14 22:26:59 controller.py:204] Submitted managed job 7 (task: 1, name: 'task2'); SKYPILOT_TASK_ID: sky-managed-2024-11-14-22-24-25-186083_sky-f74c-andyl_task2_7-1
(sky-f74c-andyl, pid=635289) I 11-14 22:26:59 controller.py:208] Started monitoring.
(sky-f74c-andyl, pid=635289) I 11-14 22:26:59 state.py:352] Launching the spot cluster...
(sky-f74c-andyl, pid=635289) I 11-14 22:27:02 cloud_vm_ray_backend.py:1505] ⚙︎ Launching on GCP us-central1 (us-central1-a).
(sky-f74c-andyl, pid=635289) I 11-14 22:27:46 provisioner.py:445] └── Instance is up.
(sky-f74c-andyl, pid=635289) I 11-14 22:28:16 provisioner.py:550] ✓ Cluster launched: task2-7. View logs at: ~/sky_logs/sky-2024-11-14-22-26-59-696270/provision.log
(sky-f74c-andyl, pid=635289) I 11-14 22:28:16 execution.py:303] ⚙︎ Mounting files.
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3360] ⚙︎ Job submitted, ID: 1
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3418]
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3418] Job ID: 1
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3418] 📋 Useful Commands
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3418] ├── To cancel the job: sky cancel task2-7 1
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3418] ├── To stream job logs: sky logs task2-7 1
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3418] └── To view job queue: sky queue task2-7
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3511]
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3511] Cluster name: task2-7
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3511] ├── To log into the head VM: ssh task2-7
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3511] ├── To submit a job: sky exec task2-7 yaml_file
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3511] ├── To stop the cluster: sky stop task2-7
(sky-f74c-andyl, pid=635289) I 11-14 22:28:20 cloud_vm_ray_backend.py:3511] └── To teardown the cluster: sky down task2-7
(sky-f74c-andyl, pid=635289)
(sky-f74c-andyl, pid=635289) I 11-14 22:28:21 recovery_strategy.py:321] Managed job cluster launched.
(sky-f74c-andyl, pid=635289) I 11-14 22:28:24 utils.py:94] === Checking the job status... ===
(sky-f74c-andyl, pid=635289) I 11-14 22:28:25 utils.py:100] Job status: JobStatus.SUCCEEDED
(sky-f74c-andyl, pid=635289) I 11-14 22:28:25 utils.py:103] ==================================
(sky-f74c-andyl, pid=635289) I 11-14 22:28:26 state.py:365] Job started.
(sky-f74c-andyl, pid=635289) I 11-14 22:28:46 utils.py:94] === Checking the job status... ===
(sky-f74c-andyl, pid=635289) I 11-14 22:28:47 utils.py:100] Job status: JobStatus.SUCCEEDED
(sky-f74c-andyl, pid=635289) I 11-14 22:28:47 utils.py:103] ==================================
(sky-f74c-andyl, pid=635289) I 11-14 22:28:48 state.py:425] Job succeeded.
(sky-f74c-andyl, pid=635289) I 11-14 22:28:48 controller.py:245] Managed job 7 (task: 1) SUCCEEDED. Cleaning up the cluster task2-7.
(sky-f74c-andyl, pid=635289) I 11-14 22:29:18 controller.py:479] Task 1 completed.
(sky-f74c-andyl, pid=635289) I 11-14 22:29:18 controller.py:419] Task 1 completed with result: True
(sky-f74c-andyl, pid=635289) I 11-14 22:29:19 controller.py:593] Killing controller process 711571.
(sky-f74c-andyl, pid=635289) I 11-14 22:29:19 controller.py:601] Controller process 711571 killed.
(sky-f74c-andyl, pid=635289) I 11-14 22:29:19 controller.py:603] Cleaning up any cluster for job 7.
(sky-f74c-andyl, pid=635289) I 11-14 22:29:19 controller.py:612] Cluster of managed job 7 has been cleaned up.
✓ Job finished (status: SUCCEEDED).
We can see that although choosing AWS for the second task is cheaper, the overall cost including the ingress fee is higher. Our former version has the correct behavior.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh
Previously, global optimization was only performed on the user's machine for display purposes, and its results were not propagated to the controller. This PR adds global optimization on the controller side, which is particularly important when data transfer costs are involved (#4320).
To illustrate the changes of global optimization, we need to test it with #4320, since without data consideration, the global optimization would not make a difference compared to single node optimization. Only with data dependencies, can a node's position choosing affect another.
After running the above code, our new implementation gives different behaviors:
Old Behavior
New Behavior
We can see that although choosing AWS for the second task is cheaper, the overall cost including the ingress fee is higher. Our former version has the correct behavior.