skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 512 forks source link

[cudo] Provisioning failing with Invalid value for `count_vm_available` #3829

Open romilbhardwaj opened 3 months ago

romilbhardwaj commented 3 months ago

sky launch is failing with an error from Cudo API. Also reported by users on slack.

(base) ➜  ~ sky launch -c vm --cloud cudo --gpus A6000
I 08-13 07:49:33 optimizer.py:691] == Optimizer ==
I 08-13 07:49:33 optimizer.py:714] Estimated cost: $0.8 / hour
I 08-13 07:49:33 optimizer.py:714]
I 08-13 07:49:33 optimizer.py:839] Considered resources (1 node):
I 08-13 07:49:33 optimizer.py:909] ----------------------------------------------------------------------------------------------------------
I 08-13 07:49:33 optimizer.py:909]  CLOUD   INSTANCE                      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
I 08-13 07:49:33 optimizer.py:909] ----------------------------------------------------------------------------------------------------------
I 08-13 07:49:33 optimizer.py:909]  Cudo    epyc-rome-rtx-a6000_4x1v2gb   2       4         RTXA6000:1     no-luster-1   0.80          ✔
I 08-13 07:49:33 optimizer.py:909] ----------------------------------------------------------------------------------------------------------
I 08-13 07:49:33 optimizer.py:909]
I 08-13 07:49:33 optimizer.py:927] Multiple Cudo instances satisfy RTXA6000:1. The cheapest Cudo(epyc-rome-rtx-a6000_4x1v2gb, {'RTXA6000': 1}) is considered among:
I 08-13 07:49:33 optimizer.py:927] ['epyc-rome-rtx-a6000_4x1v2gb', 'epyc-rome-rtx-a6000_8x1v4gb', 'epyc-rome-rtx-a6000_24x1v12gb', 'epyc-rome-rtx-a6000_48x1v24gb'].
I 08-13 07:49:33 optimizer.py:927]
I 08-13 07:49:33 optimizer.py:933] To list more details, run 'sky show-gpus RTXA6000'.
Launching a new cluster 'vm'. Proceed? [Y/n]:
I 08-13 07:49:34 cloud_vm_ray_backend.py:4354] Creating a new cluster: 'vm' [1x Cudo(epyc-rome-rtx-a6000_4x1v2gb, {'RTXA6000': 1})].
I 08-13 07:49:34 cloud_vm_ray_backend.py:4354] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 08-13 07:49:34 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /Users/romilb/sky_logs/sky-2024-08-13-07-49-33-642961/provision.log
I 08-13 07:49:34 provisioner.py:65] Launching on Cudo no-luster-1 (all zones)
W 08-13 07:49:36 instance.py:89] run_instances: Invalid value for `count_vm_available`, must not be `None`

Full stack trace:

D 08-13 07:52:38 provisioner.py:171] Failed to provision 'vm' on Cudo (all zones).
D 08-13 07:52:38 provisioner.py:173] bulk_provision for 'vm' failed. Stacktrace:
D 08-13 07:52:38 provisioner.py:173] Traceback (most recent call last):
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 165, in bulk_provision
D 08-13 07:52:38 provisioner.py:173]     return _bulk_provision(cloud, region, zones, cluster_name,
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 87, in _bulk_provision
D 08-13 07:52:38 provisioner.py:173]     provision_record = provision.run_instances(provider_name,
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/__init__.py", line 47, in _wrapper
D 08-13 07:52:38 provisioner.py:173]     return impl(*args, **kwargs)
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/cudo/instance.py", line 86, in run_instances
D 08-13 07:52:38 provisioner.py:173]     cudo_wrapper.vm_available(to_start_count, gpu_count, gpu_model, region,
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/cudo/cudo_wrapper.py", line 133, in vm_available
D 08-13 07:52:38 provisioner.py:173]     types = api.list_vm_machine_types(mem,
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api/virtual_machines_api.py", line 1423, in list_vm_machine_types
D 08-13 07:52:38 provisioner.py:173]     (data) = self.list_vm_machine_types_with_http_info(memory_gib, vcpu, **kwargs)  # noqa: E501
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api/virtual_machines_api.py", line 1523, in list_vm_machine_types_with_http_info
D 08-13 07:52:38 provisioner.py:173]     return self.api_client.call_api(
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 326, in call_api
D 08-13 07:52:38 provisioner.py:173]     return self.__call_api(resource_path, method,
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 170, in __call_api
D 08-13 07:52:38 provisioner.py:173]     return_data = self.deserialize(response_data, response_type)
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 242, in deserialize
D 08-13 07:52:38 provisioner.py:173]     return self.__deserialize(data, response_type)
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 281, in __deserialize
D 08-13 07:52:38 provisioner.py:173]     return self.__deserialize_model(data, klass)
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 627, in __deserialize_model
D 08-13 07:52:38 provisioner.py:173]     instance = klass(**kwargs)
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/models/list_vm_machine_types_response.py", line 76, in __init__
D 08-13 07:52:38 provisioner.py:173]     self.count_vm_available = count_vm_available
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/models/list_vm_machine_types_response.py", line 105, in count_vm_available
D 08-13 07:52:38 provisioner.py:173]     raise ValueError("Invalid value for `count_vm_available`, must not be `None`")  # noqa: E501
D 08-13 07:52:38 provisioner.py:173] ValueError: Invalid value for `count_vm_available`, must not be `None`
D 08-13 07:52:38 provisioner.py:173] 
romilbhardwaj commented 3 months ago

cc @JungleCatSW were there any recent changes to the cudo API which may be causing this?

sarath7974 commented 1 week ago

I’m also experiencing this issue with launching VMs on Cudo Cloud. Is there any progress on resolving this.

Michaelvll commented 1 week ago

@sarath7974 Would you be able to test with this branch: https://github.com/skypilot-org/skypilot/pull/3841

If it works for you, we will merge it soon : )

sarath7974 commented 1 week ago

@Michaelvll Tested PR #3841, but encountered a different error.

ValueError: Invalid value for `network_id`, must not be `None`

During handling of the above exception, another exception occurred:

TypeError: NetworksApi.delete_network() missing 1 required positional argument: 'network_id'

The above exception was the direct cause of the following exception:

sky.provision.common.StopFailoverError: During provisioner's failover, terminating 'vm' failed. This can cause resource leakage. Please check the failure and the cluster status on the cloud, and manually terminate the cluster. Details: [TypeError] NetworksApi.delete_network() missing 1 required positional argument: 'network_id'