skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.48k stars 462 forks source link

stuck "Waiting for SSH access" state when use k8s #3752

Open shengkaixuan opened 1 month ago

shengkaixuan commented 1 month ago

the first time i use skypilot it went well . use k8s as the cloud, but now it successfully created the pod and stuck in the "⠦ Launching - Waiting for SSH access" state.

Task from YAML spec: test.yaml I 07-15 09:10:17 optimizer.py:695] == Optimizer == I 07-15 09:10:17 optimizer.py:718] Estimated cost: $0.0 / hour I 07-15 09:10:17 optimizer.py:718] I 07-15 09:10:17 optimizer.py:843] Considered resources (1 node): I 07-15 09:10:17 optimizer.py:913] --------------------------------------------------------------------------------------------- I 07-15 09:10:17 optimizer.py:913] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN I 07-15 09:10:17 optimizer.py:913] --------------------------------------------------------------------------------------------- I 07-15 09:10:17 optimizer.py:913] Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔ I 07-15 09:10:17 optimizer.py:913] --------------------------------------------------------------------------------------------- I 07-15 09:10:17 optimizer.py:913] Launching a new cluster 'sheng'. Proceed? [Y/n]: Y I 07-15 09:10:19 cloud_vm_ray_backend.py:4411] Creating a new cluster: 'sheng' [1x Kubernetes(2CPU--2GB)]. I 07-15 09:10:19 cloud_vm_ray_backend.py:4411] Tip: to reuse an existing cluster, specify --cluster (-c). Runsky statusto see existing clusters. I 07-15 09:10:19 cloud_vm_ray_backend.py:1406] To view detailed progress: tail -n100 -f /root/sky_logs/sky-2024-07-15-09-10-17-195307/provision.log I 07-15 09:10:21 provisioner.py:73] Launching on Kubernetes 'sheng'. ⠦ Launching - Waiting for SSH access

Here are the logs:

D 07-15 09:11:28 provisioner.py:409] Retrying in 1 second... D 07-15 09:11:29 provisioner.py:354] Waiting for SSH using command: ssh -T -i '~/.ssh/sky-key' sky@10.42.0.233 -p 22 -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o ConnectTimeout=10s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o 'ProxyCommand=ssh -tt -i /root/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='"'"'/root/.sky/kubernetes-port-forward-proxy-command.sh sheng-4756-head'"'"' ' uptime D 07-15 09:11:39 provisioner.py:369] Waiting for SSH using command: ssh -T -i '~/.ssh/sky-key' sky@10.42.0.233 -p 22 -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o ConnectTimeout=10s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o 'ProxyCommand=ssh -tt -i /root/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='"'"'/root/.sky/kubernetes-port-forward-proxy-command.sh sheng-4756-head'"'"' ' uptimeError: Command '['ssh', '-T', '-i', '~/.ssh/sky-key', 'sky@10.42.0.233', '-p', '22', '-o', 'StrictHostKeyChecking=no', '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10s', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'IdentitiesOnly=yes', '-o', 'ExitOnForwardFailure=yes', '-o', 'ServerAliveInterval=5', '-o', 'ServerAliveCountMax=3', '-o', "ProxyCommand=ssh -tt -i /root/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='/root/.sky/kubernetes-port-forward-proxy-command.sh sheng-4756-head' ", 'uptime']' timed out after 10 seconds D 07-15 09:11:39 provisioner.py:409] Retrying in 1 second... D 07-15 09:11:40 provisioner.py:354] Waiting for SSH using command: ssh -T -i '~/.ssh/sky-key' sky@10.42.0.233 -p 22 -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o ConnectTimeout=10s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o 'ProxyCommand=ssh -tt -i /root/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='"'"'/root/.sky/kubernetes-port-forward-proxy-command.sh sheng-4756-head'"'"' ' uptime

how to solve the problem??

romilbhardwaj commented 1 month ago

Hi @shengkaixuan - what commit and sky version are you on? sky -v and sky -c. Can you try on the latest master branch?

Any details on the Kubernetes cluster (e.g., GKE/EKS/K3s/Kind...) to help me reproduce would be helpful.

To debug:

shengkaixuan commented 1 month ago

thanks for your reply.

i uesd k3s cluster v1.28.11+k3s2 sky version: skypilot, version 0.6.0 the pod is healthy: sheng-4756-head 1/1 Running 0 51m port forward: (sky) root@KVM:~/skytest# kubectl port-forward pod/sheng-4756-head :22 Forwarding from 127.0.0.1:43261 -> 22 Forwarding from [::1]:43261 -> 22

this is where it fails:

(sky) root@KVM:~/skytest# ssh -T -i '~/.ssh/sky-key' sky@10.42.0.233 -p 22 -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o ConnectTimeout=10s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o 'ProxyCommand=ssh -tt -i /root/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='"'"'/root/.sky/kubernetes-port-forward-proxy-command.sh sheng-4756-head'"'"' ' uptime Warning: Permanently added '127.0.0.1' (ED25519) to the list of known hosts. sky@127.0.0.1's password: Connection timed out during banner exchange Connection to UNKNOWN port 65535 timed out

it asks for the password, i do not know what it is, after a short while, it just "Connection timed out"

hope you can help me thanks.

note: the first time i use sky in my cluster, it went well.

romilbhardwaj commented 1 month ago

Hi @shengkaixuan - I'm not able to reproduce this on 0.6.0 on a KinD cluster. We recently updated the SSH jump pod mechanics and the way SkyPilot interacts with pods. To get these fixes, I would recommend updating to the latest master branch with:

pip uninstall skypilot
pip install "skypilot-nightly[kubernetes]"

Can you try with the latest nightly and see if the issue still persists?

shengkaixuan commented 1 month ago

no luck ,still stuck

have no idea how to deal with the following stuff.

`D 07-17 11:10:52 provisioner.py:398] Retrying in 1 second...

D 07-17 11:10:53 provisioner.py:343] Waiting for SSH using command: ssh -T -i '~/.ssh/sky-key' sky@10.42.0.34 -p 22 -o StrictHostKeyChecking=no -o PasswordAuthentication=no -o ConnectTimeout=10s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o 'ProxyCommand=ssh -tt -i /root/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -W %h:%p sky@127.0.0.1 -o ProxyCommand='"'"'/root/.sky/kubernetes-port-forward-proxy-command.sh test-yaml-4756-head'"'"' ' uptime`

shengkaixuan commented 1 month ago

hi @romilbhardwaj , i find a way to get around. the vm i run sky cli is the one worker node of k3s, if the head pod sky launched is on the same node where i run sky cli, it will work fine, if the head pod sky launched is on another vm node, it will stuck in the "Waiting for SSH access" state