skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.78k stars 509 forks source link

[Lambda] V100:8 multi-node provisioning fails #3880

Open romilbhardwaj opened 2 months ago

romilbhardwaj commented 2 months ago
$ sky launch -c lamb2 --num-nodes 2 --cloud lambda --gpus V100:8 -- echo hi

I 08-27 11:03:34 cloud_vm_ray_backend.py:1721] Launching on Lambda us-south-1
D 08-27 11:03:34 cloud_vm_ray_backend.py:207] `ray up` script: /var/folders/98/hhq8wrtx6y13196q61xphjsm0000gn/T/skypilot_ray_up_bq0lv6xt.py
D 08-27 11:07:04 cloud_vm_ray_backend.py:1792] `ray up` takes 210.3 seconds with 1 retries.
I 08-27 11:07:04 cloud_vm_ray_backend.py:729] ====== stdout ======
2024-08-27 11:03:34,446 INFO commands.py:311 -- Cluster: lamb2-2ea4
D 08-27 11:03:34 skypilot_config.py:194] Using config path: /Users/romilb/.sky/config.yaml
D 08-27 11:03:34 skypilot_config.py:198] Config loaded:
D 08-27 11:03:34 skypilot_config.py:198] {'allowed_clouds': ['kubernetes', 'aws', 'gcp', 'lambda', 'cudo']}
D 08-27 11:03:34 skypilot_config.py:210] Config syntax check passed.
2024-08-27 11:03:34,629 INFO commands.py:388 -- Checking External environment settings
2024-08-27 11:03:36,674 INFO commands.py:688 -- No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
2024-08-27 11:03:36,675 INFO usage_lib.py:441 -- Usage stats collection is disabled.
No head node exists, need to create it.
2024-08-27 11:03:36,675 INFO commands.py:747 -- Acquiring an up-to-date head node
2024-08-27 11:07:01,113 INFO commands.py:763 -- Launched a new head node
2024-08-27 11:07:01,113 INFO commands.py:767 -- Fetching the new head node
E 08-27 11:07:04 subprocess_utils.py:84]

I 08-27 11:07:04 cloud_vm_ray_backend.py:731] ====== stderr ======
2024-08-27 11:03:34,446 WARNING util.py:259 -- Dropping the empty legacy field head_node. head_nodeis not supported for ray>=2.0.0. It is recommended to removehead_node from the cluster config.
2024-08-27 11:03:34,446 WARNING util.py:259 -- Dropping the empty legacy field worker_nodes. worker_nodesis not supported for ray>=2.0.0. It is recommended to removeworker_nodes from the cluster config.
2024-08-27 11:03:34,446 INFO util.py:382 -- setting max workers for head node type to 0
Traceback (most recent call last):
  File "/var/folders/98/hhq8wrtx6y13196q61xphjsm0000gn/T/skypilot_ray_up_bq0lv6xt.py", line 77, in <module>
    sdk.create_or_update_cluster('/Users/romilb/.sky/generated/lamb2.yml', **{'no_restart': True})
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/ray/autoscaler/sdk/sdk.py", line 38, in create_or_update_cluster
    return commands.create_or_update_cluster(
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 317, in create_or_update_cluster
    get_or_create_head_node(
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 773, in get_or_create_head_node
    nodes = provider.non_terminated_nodes(head_node_tags)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/skylet/providers/lambda_cloud/node_provider.py", line 197, in non_terminated_nodes
    nodes = self._get_filtered_nodes(tag_filters=tag_filters)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/skylet/providers/lambda_cloud/node_provider.py", line 38, in wrapper
    return f(self, *args, **kwargs)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/skylet/providers/lambda_cloud/node_provider.py", line 180, in _get_filtered_nodes
    subprocess_utils.run_in_parallel(_get_internal_ip, nodes)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/subprocess_utils.py", line 65, in run_in_parallel
    return list(p.imap(func, args))
  File "/Users/romilb/tools/anaconda3/lib/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
  File "/Users/romilb/tools/anaconda3/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/skylet/providers/lambda_cloud/node_provider.py", line 165, in _get_internal_ip
    subprocess_utils.handle_returncode(
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/utils/subprocess_utils.py", line 91, in handle_returncode
    raise exceptions.CommandError(returncode, command, format_err_msg,
sky.exceptions.CommandError: Command ip -4 -br addr show | grep UP | grep -Eo "(10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" failed with return code 1.
Failed get obtain private IP from node

Also puts SkyPilot in a bad state, where sky down also fails with the same error. Have to manually terminate the instance and clean up local SkyPilot state.db.

sfrolich commented 2 months ago

I'm seeing the same thing with skypilot==0.6.1 trying to start a cluster in Lambda Labs. After the error the Node(s) are indeed in Lambda. Also when running "sky down" it does not tear down the Node(s)

romilbhardwaj commented 2 months ago

@sfrolich - What GPU are you using? Do you see this for GPUs other than V100?

sfrolich commented 2 months ago

@sfrolich - What GPU are you using? Do you see this for GPUs other than V100?

Was using A10:1 in this particular test. I presume from where the error occurs that it is with all Lambda VM types but I could be wrong.

sfrolich commented 1 month ago

I found in the provision.log that there was a connection reset error before the "Failed get obtain private IP from node" error. I looked in Lambda Labs Firewall page and someone in my org had taken off the SSH:22 rule. Once I put it back this error went away.