skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

Custom image "Failed to launch the sky serve replica cluster with error: RuntimeError: Failed to SSH to 213.181.111.2 after timeout 600s, with Error: ConnectionRefusedError: [Errno 111] Connection refused)" #4282

Open alita-moore opened 2 weeks ago

alita-moore commented 2 weeks ago

I am trying to use a custom image_id for the creation of a skypilot service which is running on runpod. I am using a standard docker image, but am getting the following error.

I 11-07 00:28:36 cloud_vm_ray_backend.py:1505] ⚙︎ Launching on RunPod IS.
E 11-07 07:37:21 provisioner.py:582] ⨯ Failed to set up SkyPilot runtime on cluster.  View logs at: ~/sky_logs/sky-2024-11-06-22-50-06-084977/provision.log

I 11-07 07:37:21 replica_managers.py:121] Failed to launch the sky serve replica cluster with error: RuntimeError: Failed to SSH to 213.181.111.2 after timeout 600s, with Error: ConnectionRefusedError: [Errno 111] Connection refused)
I 11-07 07:37:21 replica_managers.py:124]   Traceback: Traceback (most recent call last):
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 98, in launch_cluster
I 11-07 07:37:21 replica_managers.py:124]     sky.launch(task,
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
I 11-07 07:37:21 replica_managers.py:124]     return f(*args, **kwargs)
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
I 11-07 07:37:21 replica_managers.py:124]     return f(*args, **kwargs)
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 455, in launch
I 11-07 07:37:21 replica_managers.py:124]     return _execute(
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/execution.py", line 281, in _execute
I 11-07 07:37:21 replica_managers.py:124]     handle = backend.provision(task,
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 386, in _record
I 11-07 07:37:21 replica_managers.py:124]     return f(*args, **kwargs)
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/utils/common_utils.py", line 366, in _record
I 11-07 07:37:21 replica_managers.py:124]     return f(*args, **kwargs)
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/backends/backend.py", line 60, in provision
I 11-07 07:37:21 replica_managers.py:124]     return self._provision(task, to_provision, dryrun, stream_logs,
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2841, in _provision
I 11-07 07:37:21 replica_managers.py:124]     cluster_info = provisioner.post_provision_runtime_setup(
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/provision/provisioner.py", line 576, in post_provision_runtime_setup
I 11-07 07:37:21 replica_managers.py:124]     return _post_provision_setup(cloud_name,
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/provision/provisioner.py", line 438, in _post_provision_setup
I 11-07 07:37:21 replica_managers.py:124]     wait_for_ssh(cluster_info, ssh_credentials)
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/provision/provisioner.py", line 387, in wait_for_ssh
I 11-07 07:37:21 replica_managers.py:124]     _retry_ssh_thread((ip, ssh_port))
I 11-07 07:37:21 replica_managers.py:124]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/provision/provisioner.py", line 377, in _retry_ssh_thread
I 11-07 07:37:21 replica_managers.py:124]     raise RuntimeError(
I 11-07 07:37:21 replica_managers.py:124] RuntimeError: Failed to SSH to 213.181.111.2 after timeout 600s, with Error: ConnectionRefusedError: [Errno 111] Connection refused
I 11-07 07:37:21 replica_managers.py:124] 
E 11-07 07:37:22 ux_utils.py:117] Failed to run launch_cluster. Details: RuntimeError: Failed to launch the sky serve replica cluster sky-service-200c-3 after 3 retries.
E 11-07 07:37:22 ux_utils.py:120]   Traceback:
E 11-07 07:37:22 ux_utils.py:120] Traceback (most recent call last):
E 11-07 07:37:22 ux_utils.py:120]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/utils/ux_utils.py", line 115, in run
E 11-07 07:37:22 ux_utils.py:120]     self.func(*args, **kwargs)
E 11-07 07:37:22 ux_utils.py:120]   File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 130, in launch_cluster
E 11-07 07:37:22 ux_utils.py:120]     raise RuntimeError('Failed to launch the sky serve replica cluster '
E 11-07 07:37:22 ux_utils.py:120] RuntimeError: Failed to launch the sky serve replica cluster sky-service-200c-3 after 3 retries.
E 11-07 07:37:22 ux_utils.py:120] 
I 11-07 07:37:30 replica_managers.py:155] Replica cluster sky-service-200c-3 is already terminated.

should I be installing some dependencies or running a server on the replica / image? I noticed that ports 8266,6380 are exposed but I don't have any services runn on those ports and I didn't update the ssh public keys or what have you.

Version & Commit info:

alita-moore commented 2 weeks ago

it seems the issue is that I had a non-root default user. I had to install sudo for the init script to run but then it seems that the ssh was trying to sign in as root so ssh wouldn't connect. Is it possible to change the ssh user? It would be better in terms of security.

Michaelvll commented 1 week ago

Thanks for reporting!

ports 8266,6380

We don't need to expose those port. It should be possible to remove these ports: https://github.com/skypilot-org/skypilot/blob/master/sky/provision/runpod/utils.py#L157-L158

it seems the issue is that I had a non-root default user. I had to install sudo for the init script to run but then it seems that the ssh was trying to sign in as root so ssh wouldn't connect. Is it possible to change the ssh user? It would be better in terms of security.

We have to find a way to get the username in the docker image and add it to the ClusterInfo for the cluster created on RunPod here: https://github.com/skypilot-org/skypilot/blob/master/sky/provision/runpod/instance.py#L180-L186 A reference for how we do it for kubernetes: https://github.com/skypilot-org/skypilot/blob/master/sky/provision/kubernetes/instance.py#L891-L920 I suppose we can do something similar for runpod, by using some their cloud API to fetch the username.

We would really appreciate your contribution to these two issues.

alita-moore commented 1 week ago

I think the easiest way to do this would just be to use a special environment variable that defines what the docker user should be. The benefit of it being automatic seems small compared to the complexity.

I'll take a look the next time I'm working on the skypilot side, but I'm unfortunately quite bandwidth-constrained right now.

Michaelvll commented 1 week ago

I think the easiest way to do this would just be to use a special environment variable that defines what the docker user should be. The benefit of it being automatic seems small compared to the complexity.

Ahh, this makes sense. It might worth using a SKYPILOT_DOCKER_SSH_USERNAME as we did for the login password. cc'ing @cblmemo