run-house / runhouse

Dispatch and distribute your ML training to "serverless" clusters in Python, like PyTorch for ML infra. Iterable, debuggable, multi-cloud/on-prem, identical across research and production.
https://run.house
Apache License 2.0
962 stars 37 forks source link

Consistently hit "BaseSSHTunnelForwarderError" #90

Closed htang2012 closed 1 year ago

htang2012 commented 1 year ago

Describe the bug Hi, for the runhouse version 0.0.9, I consistently hit error when run the following script ( it worked before for previous version) import runhouse as rh gpu = rh.cluster(ips=['127.0.0.1'], ssh_creds={'ssh_user': 'rhclient', 'ssh_private_key':'/home/rhclient/.ssh/id_rsa'}, name='rh-cls') print("#################Restart server") print("Exit now") ....

INFO | 2023-07-31 18:30:20,983 | No auth token provided, so not using RNS API to save and load configs INFO | 2023-07-31 18:30:21,832 | Connected (version 2.0, client OpenSSH_8.9p1) INFO | 2023-07-31 18:30:21,944 | Authentication (publickey) failed. INFO | 2023-07-31 18:30:21,951 | Connected (version 2.0, client OpenSSH_8.9p1) INFO | 2023-07-31 18:30:22,010 | Authentication (publickey) failed. 2023-07-31 18:30:22,010| ERROR | Could not open connection to gateway ERROR | 2023-07-31 18:30:22,010 | Could not open connection to gateway 2023-07-31 18:30:22,011| ERROR | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable ERROR | 2023-07-31 18:30:22,011 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable INFO | 2023-07-31 18:30:22,011 | Server rh-cls is up, but the HTTP server may not be up. INFO | 2023-07-31 18:30:22,011 | Restarting HTTP server on rh-cls. INFO | 2023-07-31 18:30:22,011 | Running command on rh-cls: pkill -f "python -m runhouse.servers.http.http_server" Warning: Permanently added '127.0.0.1' (ED25519) to the list of known hosts. Permission denied, please try again. Permission denied, please try again. rhclient@127.0.0.1: Permission denied (publickey,password). INFO | 2023-07-31 18:30:22,123 | Running command on rh-cls: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_rh-cls.log 2>&1' Warning: Permanently added '127.0.0.1' (ED25519) to the list of known hosts. Permission denied, please try again. Permission denied, please try again. rhclient@127.0.0.1: Permission denied (publickey,password). INFO | 2023-07-31 18:30:27,237 | Checking server rh-cls again. Traceback (most recent call last): File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 357, in check_server self.connect_server_client() File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 324, in connect_server_client self._rpc_tunnel, connected_port = self.ssh_tunnel( File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 411, in ssh_tunnel ssh_tunnel.start() File "/home/rhclient/.local/lib/python3.10/site-packages/sshtunnel.py", line 1331, in start self._raise(BaseSSHTunnelForwarderError, File "/home/rhclient/.local/lib/python3.10/site-packages/sshtunnel.py", line 1174, in _raise raise exception(reason) sshtunnel.BaseSSHTunnelForwarderError: Could not establish session to SSH gateway

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/devspace/test_self_hosted_llm.py", line 14, in gpu = rh.cluster(ips=['127.0.0.1'], File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster_factory.py", line 59, in cluster return Cluster(ips=ips, ssh_creds=ssh_creds, name=name, dryrun=dryrun) File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 58, in init self.check_server() File "/home/rhclient/.local/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 379, in check_server self.client.check_server(cluster_config=cluster_config) AttributeError: 'NoneType' object has no attribute 'check_server'

Versions Please run the following and paste the output below


# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Python Platform: Linux-5.15.0-60-lowlatency-x86_64-with-glibc2.35 Python Version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] Relevant packages: boto3==1.28.15 fastapi==0.99.0 fsspec==2023.5.0 pyarrow==12.0.1 pycryptodome==3.12.0 rich==13.5.1 runhouse==0.0.9 skypilot==0.3.3 sshfs==2023.7.0 sshtunnel==0.4.0 typer==0.9.0 uvicorn==0.23.2 wheel==0.38.4

Additional context I started: 1) ray start --head 2) runhouse login ... 3) python -m runhouse.servers.http.http_server

carolineechen commented 1 year ago

Hi @htang2012, thanks for raising this. Will take a look tino this and had a couple of questions first:

htang2012 commented 1 year ago

Hi @carolineechen , it works with the recent changes, right now I can only create two dockers, both of them running on root account.