skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.55k stars 467 forks source link

[k8s] SSH ProxyCommand script is not concurrency-safe #2628

Closed romilbhardwaj closed 11 months ago

romilbhardwaj commented 11 months ago

Our SSH proxycommand script which uses socat + kubectl port-forward breaks when many connections are created in parallel.

Repro:

  1. sky local up
  2. sky launch -c myclus --cloud kubernetes -y
  3. Run this python:
    
    def run_command_in_parallel(cmd, num_times, poolsize=8):
    """Run a command multiple times in parallel."""
    import subprocess
    from multiprocessing import pool
    def _exec(i):
        subprocess.run(cmd, check=True)
    with pool.ThreadPool(poolsize) as p:
        list(p.imap(_exec, range(num_times)))

Works fine

run_command_in_parallel(['ssh','myclus','echo hi'], 5, 5)

Error:

run_command_in_parallel(['ssh','myclus','echo hi'], 10, 10)


Logs:

Connection to 127.0.0.1 port 36130 [tcp/] succeeded! Connection to 127.0.0.1 port 28899 [tcp/] succeeded! Connection to 127.0.0.1 port 25393 [tcp/] succeeded! Connection to 127.0.0.1 port 37811 [tcp/] succeeded! Connection to 127.0.0.1 port 23623 [tcp/] succeeded! Connection to 127.0.0.1 port 25562 [tcp/] succeeded! Connection to 127.0.0.1 port 20357 [tcp/] succeeded! Connection to 127.0.0.1 port 39773 [tcp/] succeeded! E0929 09:54:37.461697 58402 portforward.go:394] error copying from local connection to remote stream: read tcp4 127.0.0.1:39773->127.0.0.1:53535: read: connection reset by peer Connection to 127.0.0.1 port 35542 [tcp/] succeeded! Connection to 127.0.0.1 port 20450 [tcp/] succeeded! Warning: Permanently added '[127.0.0.1]:23100' (ED25519) to the list of known hosts. Warning: Permanently added '[127.0.0.1]:23100' (ED25519) to the list of known hosts. Warning: Permanently added '[127.0.0.1]:23100' (ED25519) to the list of known hosts. Warning: Permanently added '[127.0.0.1]:23100' (ED25519) to the list of known hosts. Warning: Permanently added '[127.0.0.1]:23100' (ED25519) to the list of known hosts. Warning: Permanently added '10.244.0.14' (ED25519) to the list of known hosts. Warning: Permanently added '10.244.0.14' (ED25519) to the list of known hosts. Warning: Permanently added '10.244.0.14' (ED25519) to the list of known hosts. Warning: Permanently added '10.244.0.14' (ED25519) to the list of known hosts. Warning: Permanently added '10.244.0.14' (ED25519) to the list of known hosts. hi hi hi hi hi kex_exchange_identification: Connection closed by remote host kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 Traceback (most recent call last): File "", line 1, in File "", line 8, in run_command_in_parallel File "/Users/romilb/tools/anaconda3/lib/python3.9/multiprocessing/pool.py", line 870, in next raise value File "/Users/romilb/tools/anaconda3/lib/python3.9/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "", line 6, in _exec File "/Users/romilb/tools/anaconda3/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ssh', 'myclus', 'echo hi']' returned non-zero exit status 255.



Possibly related to: https://github.com/kubernetes/kubernetes/issues/74551
romilbhardwaj commented 11 months ago

After more investigation:

  1. SSHing into a GCP Cluster also fails with 15 parallel SSH (run_command_in_parallel(['ssh','gcpclus','echo hi'], 15, 15)), so perhaps having this work for 10 is a safe baseline.
  2. Tried randomizing the local_port, didn't help.
  3. Tried updating the trap command to first check if process exists before killing, did not help.
  4. Digging deeper, it appears kubectl port-forward sometimes closes the connection and exits when it shouldn't. This looks related to https://github.com/kubernetes/kubectl/issues/1169:
(base) ➜  ~ kubectl port-forward svc/sky-ssh-jump-2ea485ef :22
Forwarding from 127.0.0.1:63789 -> 22
Forwarding from [::1]:63789 -> 22
Handling connection for 63789

# ... sent a SSH banner request with `echo -e "\n" | nc localhost 63789`, got SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u1 Invalid SSH identification string.
# But kubectl port-forward exited, though it shouldn't have:

E0929 15:36:37.002483    5513 portforward.go:409] an error occurred forwarding 63789 -> 22: error forwarding port 22 to pod 972148c003a81da72166efb29c50940fe0f1bd77aaa2a5a7e199fc20a7689899, uid : failed to execute portforward in network namespace "/var/run/netns/cni-8dafb022-a4da-6422-2e16-5ad83082f201": read tcp4 127.0.0.1:49948->127.0.0.1:22: read: connection reset by peer
error: lost connection to pod
  1. Curiously, adding a randomized sleep after running kubectl port-forward works.