Closed romilbhardwaj closed 11 months ago
After more investigation:
run_command_in_parallel(['ssh','gcpclus','echo hi'], 15, 15)
), so perhaps having this work for 10 is a safe baseline.local_port
, didn't help.(base) ➜ ~ kubectl port-forward svc/sky-ssh-jump-2ea485ef :22
Forwarding from 127.0.0.1:63789 -> 22
Forwarding from [::1]:63789 -> 22
Handling connection for 63789
# ... sent a SSH banner request with `echo -e "\n" | nc localhost 63789`, got SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u1 Invalid SSH identification string.
# But kubectl port-forward exited, though it shouldn't have:
E0929 15:36:37.002483 5513 portforward.go:409] an error occurred forwarding 63789 -> 22: error forwarding port 22 to pod 972148c003a81da72166efb29c50940fe0f1bd77aaa2a5a7e199fc20a7689899, uid : failed to execute portforward in network namespace "/var/run/netns/cni-8dafb022-a4da-6422-2e16-5ad83082f201": read tcp4 127.0.0.1:49948->127.0.0.1:22: read: connection reset by peer
error: lost connection to pod
kubectl port-forward
works.
Our SSH proxycommand script which uses
socat
+kubectl port-forward
breaks when many connections are created in parallel.Repro:
sky local up
sky launch -c myclus --cloud kubernetes -y
Works fine
run_command_in_parallel(['ssh','myclus','echo hi'], 5, 5)
Error:
run_command_in_parallel(['ssh','myclus','echo hi'], 10, 10)
Connection to 127.0.0.1 port 36130 [tcp/] succeeded! Connection to 127.0.0.1 port 28899 [tcp/] succeeded! Connection to 127.0.0.1 port 25393 [tcp/] succeeded! Connection to 127.0.0.1 port 37811 [tcp/] succeeded! Connection to 127.0.0.1 port 23623 [tcp/] succeeded! Connection to 127.0.0.1 port 25562 [tcp/] succeeded! Connection to 127.0.0.1 port 20357 [tcp/] succeeded! Connection to 127.0.0.1 port 39773 [tcp/] succeeded! E0929 09:54:37.461697 58402 portforward.go:394] error copying from local connection to remote stream: read tcp4 127.0.0.1:39773->127.0.0.1:53535: read: connection reset by peer Connection to 127.0.0.1 port 35542 [tcp/] succeeded! Connection to 127.0.0.1 port 20450 [tcp/] succeeded! Warning: Permanently added '[127.0.0.1]:23100' (ED25519) to the list of known hosts. Warning: Permanently added '[127.0.0.1]:23100' (ED25519) to the list of known hosts. Warning: Permanently added '[127.0.0.1]:23100' (ED25519) to the list of known hosts. Warning: Permanently added '[127.0.0.1]:23100' (ED25519) to the list of known hosts. Warning: Permanently added '[127.0.0.1]:23100' (ED25519) to the list of known hosts. Warning: Permanently added '10.244.0.14' (ED25519) to the list of known hosts. Warning: Permanently added '10.244.0.14' (ED25519) to the list of known hosts. Warning: Permanently added '10.244.0.14' (ED25519) to the list of known hosts. Warning: Permanently added '10.244.0.14' (ED25519) to the list of known hosts. Warning: Permanently added '10.244.0.14' (ED25519) to the list of known hosts. hi hi hi hi hi kex_exchange_identification: Connection closed by remote host kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 kex_exchange_identification: Connection closed by remote host Connection closed by UNKNOWN port 65535 Traceback (most recent call last): File "", line 1, in
File "", line 8, in run_command_in_parallel
File "/Users/romilb/tools/anaconda3/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
File "/Users/romilb/tools/anaconda3/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "", line 6, in _exec
File "/Users/romilb/tools/anaconda3/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ssh', 'myclus', 'echo hi']' returned non-zero exit status 255.