skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.51k stars 464 forks source link

[Spot][Serve] Cudo Compute cannot host controllers #3871

Open Stealthwriter opened 2 weeks ago

Stealthwriter commented 2 weeks ago

When I try to run skyserve with cudo instances I get this error:

`I 08-26 12:01:51 cloud_vm_ray_backend.py:4354] Tip: to reuse an existing cluster, specify -
-cluster (-c). Run `sky status` to see existing clusters.
W 08-26 12:01:51 cloud_vm_ray_backend.py:1986] sky.exceptions.NotSupportedError: The follow
ing features are not supported by Cudo:
W 08-26 12:01:51 cloud_vm_ray_backend.py:1986] Feature           Reason                   

W 08-26 12:01:51 cloud_vm_ray_backend.py:1986] stop              Stopping not supported.  

W 08-26 12:01:51 cloud_vm_ray_backend.py:1986] host_controllers  Cudo Compute cannot host 
a controller as it does not autostopping, which will leave the controller to run indefinite
ly.  
W 08-26 12:01:51 cloud_vm_ray_backend.py:2012] 
W 08-26 12:01:51 cloud_vm_ray_backend.py:2012] Provision failed for 1x Cudo(intel-broadwell
-rtx-3080_8x1v4gb, {'RTX3080': 1}, disk_size=200, ports=['30001-30020']) in ca-montreal-1. 
Trying other locations (if any).
E 08-26 12:01:51 cloud_vm_ray_backend.py:2790] Failed to provision all possible launchable 
resources. Relax the task's resource requirements: 1x <Cloud>(cpus=4+, disk_size=200, ports
=['30001-30020'])
I 08-26 12:01:51 cloud_vm_ray_backend.py:2794] === Retry until up ===
I 08-26 12:01:51 cloud_vm_ray_backend.py:2794] Retrying provisioning after 41s (backoff wit
h random jittering). Already tried 1 attempt.
`

How can I fix that?

Michaelvll commented 2 weeks ago

Hi @Stealthwriter, since Cudo does not support stopping an instance, we cannot launch our controller on Cudo at the moment. Will it be possible for you to customize the controller resources, so we can launch the controller on another cloud and your replicas on Cudo? https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html#customizing-skyserve-controller-resources

Stealthwriter commented 2 weeks ago

To do this I have to install from source right? otherwise how to access this config file if I install with pip

Stealthwriter commented 2 weeks ago

I tried to open config.yaml to edit it but it's empty

Michaelvll commented 2 weeks ago

You don't have to install from source. SkyPilot will read from ~/.sky/config.yaml.

Yes, it will be empty if did not put anything in there. Please add a section there such as:

serve:
  # NOTE: these settings only take effect for a new SkyServe controller, not if
  # you have an existing one.
  controller:
    resources:
      # All configs below are optional.
      # Specify the location of the SkyServe controller.
      cloud: gcp
Stealthwriter commented 2 weeks ago

ohh makes sense now I will try

cblmemo commented 2 weeks ago

From DM w/ @Stealthwriter the workaround (using other cloud as controller) works. We still need to think how to support host controller on Cudo though.

Stealthwriter commented 2 weeks ago

worked thanks! although cudo didn't spin the instances it got stuck at provisioning: 08-26 21:12:55 provisioner.py:65] Launching on Cudo no-luster-1 (all zones) W 08-26 21:12:55 instance.py:89] run_instances: 'VirtualMachinesApi' object has no attribute 'list_vm_machine_types' D 08-26 21:12:55 provisioner.py:171] Failed to provision 'humanizer-1' on Cudo (all zones). D 08-26 21:12:55 provisioner.py:173] bulk_provision for 'humanizer-1' failed. Stacktrace: D 08-26 21:12:55 provisioner.py:173] AttributeError: 'VirtualMachinesApi' object has no attribute 'list_vm_machine_types'. Did you mean: 'list_vm_machine_types2'? D 08-26 21:12:55 provisioner.py:173] D 08-26 21:12:55 provisioner.py:178] Terminating the failed cluster. D 08-26 21:12:56 metadata_utils.py:115] Remove metadata of cluster humanizer-1-b10f.