run-house / runhouse

Dispatch and distribute your ML training to "serverless" clusters in Python, like PyTorch for ML infra. Iterable, debuggable, multi-cloud/on-prem, identical across research and production.
https://run.house
Apache License 2.0
962 stars 37 forks source link

Need help with local gpu system #88

Closed rdabane closed 5 months ago

rdabane commented 1 year ago

Describe the bug Hi, I'm trying to use a gpu system on our local network. However I'm running into issues. Basic question: Does the runhouse package need to be installed on the remote gpu system? Couldn't figure this out from the documentation.

Here is the snippet of code I'm trying to run:

import runhouse as rh

import pdb;pdb.set_trace()
cluster = rh.cluster(
              name="mlw-cluster",
              ips=['xx.xx.xx.xx'],
              ssh_creds={'ssh_user': 'lab', 'ssh_private_key':'/export/lab/.ssh/mlw01.key'},
          )

def num_cpus():
    import multiprocessing
    return f"Num cpus: {multiprocessing.cpu_count()}"

num_cpus()
num_cpus_cluster = rh.function(name="num_cpus_cluster", fn=num_cpus).to(system=cluster, reqs=["./"])

I get following error in creating the cluster:


(Pdb) c
2023-07-20 10:17:54,985| WAR | MainThrea/1032@sshtunnel | Could not read SSH configuration file: ~/.ssh/config
WARNING | 2023-07-20 10:17:54,985 | Could not read SSH configuration file: ~/.ssh/config
2023-07-20 10:17:54,987| INF | MainThrea/1060@sshtunnel | 1 keys loaded from agent
INFO | 2023-07-20 10:17:54,987 | 1 keys loaded from agent
2023-07-20 10:17:54,988| INF | MainThrea/1117@sshtunnel | 1 key(s) loaded
INFO | 2023-07-20 10:17:54,988 | 1 key(s) loaded
2023-07-20 10:17:54,988| ERR | MainThrea/1314@sshtunnel | Password is required for key /export/lab/.ssh/mlw01.key
ERROR | 2023-07-20 10:17:54,988 | Password is required for key /export/lab/.ssh/mlw01.key
2023-07-20 10:17:54,988| INF | MainThrea/0978@sshtunnel | Connecting to gateway: xx.x.xxx.x:22 as user 'lab'
INFO | 2023-07-20 10:17:54,988 | Connecting to gateway: 172.17.10.110:22 as user 'lab'
2023-07-20 10:17:54,988| DEB | MainThrea/0983@sshtunnel | Concurrent connections allowed: True
2023-07-20 10:17:54,989| DEB | MainThrea/1400@sshtunnel | Trying to log in with key: b'asdWEQWEQWe'
2023-07-20 10:17:55,012| DEB | MainThrea/1204@sshtunnel | Transport socket info: (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0), timeout=0.1
2023-07-20 10:17:55,043| INF |  Thread-1/1893@transport | Connected (version 2.0, client OpenSSH_7.6p1)
INFO | 2023-07-20 10:17:55,043 | Connected (version 2.0, client OpenSSH_7.6p1)
2023-07-20 10:17:55,278| INF |  Thread-1/1893@transport | Authentication (publickey) successful!
INFO | 2023-07-20 10:17:55,278 | Authentication (publickey) successful!
2023-07-20 10:17:55,279| ERR | MainThrea/1230@sshtunnel | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-07-20 10:17:55,279 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
2023-07-20 10:17:55,280| WAR | MainThrea/1032@sshtunnel | Could not read SSH configuration file: ~/.ssh/config
WARNING | 2023-07-20 10:17:55,280 | Could not read SSH configuration file: ~/.ssh/config
2023-07-20 10:17:55,282| INF | MainThrea/1060@sshtunnel | 1 keys loaded from agent
INFO | 2023-07-20 10:17:55,282 | 1 keys loaded from agent
2023-07-20 10:17:55,282| INF | MainThrea/1117@sshtunnel | 1 key(s) loaded
INFO | 2023-07-20 10:17:55,282 | 1 key(s) loaded
2023-07-20 10:17:55,283| ERR | MainThrea/1314@sshtunnel | Password is required for key /export/lab/.ssh/mlw01.key
ERROR | 2023-07-20 10:17:55,283 | Password is required for key /export/lab/.ssh/mlw01.key
2023-07-20 10:17:55,283| INF | MainThrea/0978@sshtunnel | Connecting to gateway: 172.17.10.110:22 as user 'lab'
INFO | 2023-07-20 10:17:55,283 | Connecting to gateway: 172.17.10.110:22 as user 'lab'
2023-07-20 10:17:55,283| DEB | MainThrea/0983@sshtunnel | Concurrent connections allowed: True
2023-07-20 10:17:55,283| WAR | MainThrea/1618@sshtunnel | It looks like you didn't call the .stop() before the SSHTunnelForwarder obj was collected by the garbage collector! Running .stop(force=True)
WARNING | 2023-07-20 10:17:55,283 | It looks like you didn't call the .stop() before the SSHTunnelForwarder obj was collected by the garbage collector! Running .stop(force=True)
2023-07-20 10:17:55,284| INF | MainThrea/1374@sshtunnel | Closing all open connections...
INFO | 2023-07-20 10:17:55,284 | Closing all open connections...
2023-07-20 10:17:55,284| DEB | MainThrea/1378@sshtunnel | Listening tunnels: None
2023-07-20 10:17:55,284| WAR | MainThrea/1450@sshtunnel | Tunnels are not started. Please .start() first!
WARNING | 2023-07-20 10:17:55,284 | Tunnels are not started. Please .start() first!
2023-07-20 10:17:55,284| INF | MainThrea/1453@sshtunnel | Closing ssh transport
INFO | 2023-07-20 10:17:55,284 | Closing ssh transport
2023-07-20 10:17:55,284| DEB | MainThrea/1477@sshtunnel | Transport is closed
2023-07-20 10:17:55,285| DEB | MainThrea/1400@sshtunnel | Trying to log in with key: b'463095aa1803da78647cd548f37173ef'
2023-07-20 10:17:55,305| DEB | MainThrea/1204@sshtunnel | Transport socket info: (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0), timeout=0.1
2023-07-20 10:17:55,334| INF |  Thread-3/1893@transport | Connected (version 2.0, client OpenSSH_7.6p1)
INFO | 2023-07-20 10:17:55,334 | Connected (version 2.0, client OpenSSH_7.6p1)
2023-07-20 10:17:55,578| INF |  Thread-3/1893@transport | Authentication (publickey) successful!
INFO | 2023-07-20 10:17:55,578 | Authentication (publickey) successful!
2023-07-20 10:17:55,579| INF | Srv-50053/1433@sshtunnel | Opening tunnel: 0.0.0.0:50053 <> 127.0.0.1:50052
INFO | 2023-07-20 10:17:55,579 | Opening tunnel: 0.0.0.0:50053 <> 127.0.0.1:50052
INFO | 2023-07-20 10:17:55,580 | Checking server mlw-cluster
2023-07-20 10:17:55,814| TRA | Thread-5 /0360@sshtunnel | #1 <-- ('127.0.0.1', 44364) connected
2023-07-20 10:17:55,815| TRA | Thread-5 /0316@sshtunnel | >>> OUT #1 <-- ('127.0.0.1', 44364) send to ('127.0.0.1', 50052): b'504f5354202f636865636b2f20485454502f312e310d0a486f73743a203132372e302e302e313a35303035330d0a557365722d4167656e743a20707974686f6e2d72657175657374732f322e33312e300d0a4163636570742d456e636f64696e673a20677a69702c206465666c6174650d0a4163636570743a202a2f2a0d0a436f6e6e656374696f6e3a206b6565702d616c6976650d0a436f6e74656e742d4c656e6774683a203330300d0a436f6e74656e742d547970653a206170706c69636174696f6e2f6a736f6e0d0a0d0a7b2264617461223a20227b5c6e202020205c226e616d655c223a205c227e2f6d6c772d636c75737465725c222c5c6e202020205c227265736f757263655f747970655c223a205c22636c75737465725c222c5c6e202020205c227265736f757263655f737562747970655c223a205c22436c75737465725c222c5c6e202020205c226970735c223a205b5c6e20202020202020205c223137322e31372e31302e3131305c225c6e202020205d2c5c6e202020205c227373685f63726564735c223a207b5c6e20202020202020205c227373685f757365725c223a205c226c61625c222c5c6e20202020202020205c227373685f707269766174655f6b65795c223a205c222f6578706f72742f6c61622f2e7373682f6d6c7730312e6b65795c225c6e202020207d5c6e7d227d' >>>
2023-07-20 10:17:55,816| TRA | Thread-5 /0333@sshtunnel | <<< IN #1 <-- ('127.0.0.1', 44364) recv: b'5353482d322e302d4f70656e5353485f372e367031205562756e74752d347562756e7475302e350d0a' <<<
INFO | 2023-07-20 10:17:55,816 | Server mlw-cluster is up, but the HTTP server may not be up.
INFO | 2023-07-20 10:17:55,817 | Restarting HTTP server on mlw-cluster.
INFO | 2023-07-20 10:17:55,817 | Running command on mlw-cluster: pkill -f "python -m runhouse.servers.http.http_server"
2023-07-20 10:17:55,817| TRA | Thread-5 /0311@sshtunnel | >>> OUT #1 <-- ('127.0.0.1', 44364) recv empty data >>>
2023-07-20 10:17:55,820| TRA | Thread-5 /0375@sshtunnel | #1 <-- ('127.0.0.1', 44364) connection closed.
INFO | 2023-07-20 10:17:56,571 | Running command on mlw-cluster: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_mlw-cluster.log 2>&1'
INFO | 2023-07-20 10:18:02,291 | Checking server mlw-cluster again.
2023-07-20 10:18:02,318| ERR |  Thread-3/1893@transport | Secsh channel 1 open FAILED: Connection refused: Connect failed
ERROR | 2023-07-20 10:18:02,318 | Secsh channel 1 open FAILED: Connection refused: Connect failed
2023-07-20 10:18:02,318| TRA | Thread-14/0357@sshtunnel | #2 <-- ('127.0.0.1', 47456) open new channel ssh error: ChannelException(2, 'Connect failed')
2023-07-20 10:18:02,318| ERR | Thread-14/0394@sshtunnel | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-07-20 10:18:02,318 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
Traceback (most recent call last):
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 461, in _make_request
    httplib_response = conn.getresponse()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 798, in urlopen
    retries = retries.increment(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 461, in _make_request
    httplib_response = conn.getresponse()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/export/lab/work/learn_runhouse/testmlw01.py", line 4, in <module>
    cluster = rh.cluster(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster_factory.py", line 59, in cluster
    return Cluster(ips=ips, ssh_creds=ssh_creds, name=name, dryrun=dryrun)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 60, in __init__
    self.check_server()
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/rns/hardware/cluster.py", line 381, in check_server
    self.client.check_server(cluster_config=cluster_config)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 48, in check_server
    self.request(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/runhouse/servers/http/http_client.py", line 35, in request
    response = req_fn(
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Versions Please run the following and paste the output below.

wget https://raw.githubusercontent.com/run-house/runhouse/main/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Python Platform: Linux-5.19.0-46-generic-x86_64-with-glibc2.35
Python Version: 3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0]

Relevant packages: 
boto3==1.28.6
fastapi==0.99.0
fsspec==2023.6.0
pyarrow==12.0.1
pycryptodome==3.12.0
rich==13.4.2
runhouse==0.0.9
skypilot==0.3.3
sshfs==2023.7.0
sshtunnel==0.4.0
typer==0.9.0
uvicorn==0.23.1
wheel==0.38.4

SkyPilot collects usage data to improve its services. `setup` and `run` commands are not collected to ensure privacy.
Usage logging can be disabled by setting the environment variable SKYPILOT_DISABLE_USAGE_COLLECTION=1.
Checking credentials to enable clouds for SkyPilot.
  AWS: disabled          
    Reason: AWS credentials are not set. Run the following commands:
      $ pip install boto3
      $ aws configure
      $ aws configure list  # Ensure that this shows identity is set.
    For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
    Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.NoCredentialsError] Unable to locate credentials.
  Azure: disabled          
    Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
      $ az login
      $ az account set -s <subscription_id>
    For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
  GCP: disabled          
    Reason: GCP tools are not installed. Run the following commands:
      $ pip install google-api-python-client
      $ conda install -c conda-forge google-cloud-sdk -y
    Credentials may also need to be set. Run the following commands:
      $ gcloud init
      $ gcloud auth application-default login
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp
    Details: [builtins.ModuleNotFoundError] No module named 'googleapiclient'
  Lambda: disabled          
    Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
      https://cloud.lambdalabs.com/api-keys
    to generate API key and add the line
      api_key = [YOUR API KEY]
    to ~/.lambda_cloud/lambda_keys
  IBM: disabled          
    Reason: Missing credential file at /export/lab/.ibm/credentials.yaml.
    Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
      iam_api_key: <IAM_API_KEY>
      resource_group_id: <RESOURCE_GROUP_ID>
  SCP: disabled          
    Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
    Generate API key and add the following line to ~/.scp/scp_credential:
      access_key = [YOUR API ACCESS KEY]
      secret_key = [YOUR API SECRET KEY]
      project_id = [YOUR PROJECT ID]
  OCI: disabled          
    Reason: `oci` is not installed. Install it with: pip install oci
    For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
  Cloudflare (for R2 object store): disabled          
    Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
      $ pip install boto3
      $ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
      $ mkdir -p ~/.cloudflare
      $ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2

SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
If any problems remain, please file an issue at https://github.com/skypilot-org/skypilot/issues/new
Clusters
No existing clusters.

Managed spot jobs
No in progress jobs. (See: sky spot -h)

Additional context Add any other context about the problem here.

dongreenberg commented 1 year ago

Hi there, I'm a bit confused by your output, because it looks like you need a password for that ssh key, but then it looks like it connects anyway? Are you using a .ssh/config file?

Also, aside, you likely need to wrap the end of your script in an if __name__ == "main" block so that code doesn't run when it's imported on the cluster, like so:

if __name__ == '__main__':
    num_cpus()
    num_cpus_cluster = rh.function(name="num_cpus_cluster", fn=num_cpus).to(system=cluster, reqs=["./"])
rdabane commented 1 year ago

Hi, Thanks for a quick reply. I've modified the script by wrapping with if name == 'main' block.

Yes, looks like it asks for password but it connects. Then it seem to open a tunnel after which it tries to bring up the http server in which it doesn't succeed and errors out.

Q. Does the runhouse package need to be installed on the remote system?

Don't know how to debug this? Is it possible to extract the issue into a smaller ssh tunnel command ?

Ankit-Dhankhar commented 1 year ago

Hi @rdabane, I also faced the similar error while setting up runhouse with my local gpu cluster. In my case it was due to python being not being a recognized command and thus python -m runhouse.servers.http.http_server not executing successfully.

It worked for me on running following command: sudo apt install python-is-python3

Though it is not intuitive from my error logs:

(ankit)➜  ankit python temp.py                                                                    
INFO | 2023-08-06 22:51:42,948 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:51:44,075 | Authentication (publickey) successful!
INFO | 2023-08-06 22:51:44,075 | Running command on antbit-ray-cluster: ray start --head
INFO | 2023-08-06 22:51:47,044 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:51:48,171 | Authentication (publickey) successful!
INFO | 2023-08-06 22:51:48,171 | Running command on antbit-ray-cluster: pip freeze
INFO | 2023-08-06 22:51:50,833 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:51:51,857 | Authentication (publickey) successful!
INFO | 2023-08-06 22:51:51,858 | Running command on antbit-ray-cluster: pip install ray==2.4.0
INFO | 2023-08-06 22:51:56,466 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:51:57,489 | Authentication (publickey) successful!
INFO | 2023-08-06 22:51:57,490 | Running command on antbit-ray-cluster: ray start --head
INFO | 2023-08-06 22:52:00,219 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:52:01,483 | Authentication (publickey) successful!
2023-08-06 22:52:01,483| ERROR   | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-08-06 22:52:01,483 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-08-06 22:52:02,098 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:52:03,227 | Authentication (publickey) successful!
INFO | 2023-08-06 22:52:03,228 | Checking server antbit-ray-cluster
2023-08-06 22:52:03,889| ERROR   | Secsh channel 0 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:52:03,889 | Secsh channel 0 open FAILED: Connection refused: Connect failed
2023-08-06 22:52:03,893| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:52:03,893 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
INFO | 2023-08-06 22:52:03,899 | Server antbit-ray-cluster is up, but the HTTP server may not be up.
INFO | 2023-08-06 22:52:03,899 | Restarting HTTP server on antbit-ray-cluster.
INFO | 2023-08-06 22:52:04,228 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:52:05,477 | Authentication (publickey) successful!
INFO | 2023-08-06 22:52:05,477 | Running command on antbit-ray-cluster: pip install runhouse==0.0.10
INFO | 2023-08-06 22:52:09,238 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:52:10,393 | Authentication (publickey) successful!
INFO | 2023-08-06 22:52:10,393 | Running command on antbit-ray-cluster: pkill -f "python -m runhouse.servers.http.http_server"
INFO | 2023-08-06 22:52:11,417 | Running command on antbit-ray-cluster: screen -dm bash -c "python -m runhouse.servers.http.http_server |& tee -a '~/.rh/cluster_server_antbit-ray-cluster.log' 2>&1"
INFO | 2023-08-06 22:52:17,241 | Checking server antbit-ray-cluster again [1/5].
2023-08-06 22:52:17,426| ERROR   | Secsh channel 1 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:52:17,426 | Secsh channel 1 open FAILED: Connection refused: Connect failed
2023-08-06 22:52:17,429| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:52:17,429 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
INFO | 2023-08-06 22:52:22,437 | Checking server antbit-ray-cluster again [2/5].
2023-08-06 22:52:22,665| ERROR   | Secsh channel 2 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:52:22,665 | Secsh channel 2 open FAILED: Connection refused: Connect failed
2023-08-06 22:52:22,669| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:52:22,669 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
INFO | 2023-08-06 22:52:27,679 | Checking server antbit-ray-cluster again [3/5].
2023-08-06 22:52:27,903| ERROR   | Secsh channel 3 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:52:27,903 | Secsh channel 3 open FAILED: Connection refused: Connect failed
2023-08-06 22:52:27,906| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:52:27,906 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
INFO | 2023-08-06 22:52:32,915 | Checking server antbit-ray-cluster again [4/5].
2023-08-06 22:52:33,080| ERROR   | Secsh channel 4 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:52:33,080 | Secsh channel 4 open FAILED: Connection refused: Connect failed
2023-08-06 22:52:33,083| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:52:33,083 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
INFO | 2023-08-06 22:52:38,093 | Checking server antbit-ray-cluster again [5/5].
2023-08-06 22:52:38,404| ERROR   | Secsh channel 5 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:52:38,404 | Secsh channel 5 open FAILED: Connection refused: Connect failed
2023-08-06 22:52:38,408| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:52:38,408 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
(antbit)➜  antbit python temp.py
INFO | 2023-08-06 22:54:51,060 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:54:52,033 | Authentication (publickey) successful!
INFO | 2023-08-06 22:54:52,033 | Running command on antbit-ray-cluster: ray start --head
INFO | 2023-08-06 22:54:55,258 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:54:56,282 | Authentication (publickey) successful!
INFO | 2023-08-06 22:54:56,283 | Running command on antbit-ray-cluster: pip freeze
INFO | 2023-08-06 22:54:59,932 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:55:00,993 | Authentication (publickey) successful!
INFO | 2023-08-06 22:55:00,993 | Running command on antbit-ray-cluster: pip install ray==2.4.0
INFO | 2023-08-06 22:55:04,474 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:55:05,601 | Authentication (publickey) successful!
INFO | 2023-08-06 22:55:05,602 | Running command on antbit-ray-cluster: ray start --head
INFO | 2023-08-06 22:55:08,638 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:55:09,803 | Authentication (publickey) successful!
2023-08-06 22:55:09,803| ERROR   | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
ERROR | 2023-08-06 22:55:09,803 | Problem setting SSH Forwarder up: Couldn't open tunnel :50052 <> 127.0.0.1:50052 might be in use or destination not reachable
INFO | 2023-08-06 22:55:10,517 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:55:11,633 | Authentication (publickey) successful!
INFO | 2023-08-06 22:55:11,634 | Checking server antbit-ray-cluster
2023-08-06 22:55:12,462| ERROR   | Secsh channel 0 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:55:12,462 | Secsh channel 0 open FAILED: Connection refused: Connect failed
2023-08-06 22:55:12,466| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:55:12,466 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
INFO | 2023-08-06 22:55:12,472 | Server antbit-ray-cluster is up, but the HTTP server may not be up.
INFO | 2023-08-06 22:55:12,472 | Restarting HTTP server on antbit-ray-cluster.
INFO | 2023-08-06 22:55:12,909 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:55:13,998 | Authentication (publickey) successful!
INFO | 2023-08-06 22:55:13,999 | Running command on antbit-ray-cluster: pip install runhouse==0.0.10
INFO | 2023-08-06 22:55:17,734 | Connected (version 2.0, client OpenSSH_8.2p1)
INFO | 2023-08-06 22:55:20,347 | Authentication (publickey) successful!
INFO | 2023-08-06 22:55:20,347 | Running command on antbit-ray-cluster: pkill -f "python -m runhouse.servers.http.http_server"
INFO | 2023-08-06 22:55:21,474 | Running command on antbit-ray-cluster: screen -dm bash -c "python -m runhouse.servers.http.http_server |& tee -a '~/.rh/cluster_server_antbit-ray-cluster.log' 2>&1"
INFO | 2023-08-06 22:55:27,196 | Checking server antbit-ray-cluster again [1/5].
2023-08-06 22:55:27,372| ERROR   | Secsh channel 1 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:55:27,372 | Secsh channel 1 open FAILED: Connection refused: Connect failed
2023-08-06 22:55:27,376| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:55:27,376 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
INFO | 2023-08-06 22:55:32,385 | Checking server antbit-ray-cluster again [2/5].
2023-08-06 22:55:32,620| ERROR   | Secsh channel 2 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:55:32,620 | Secsh channel 2 open FAILED: Connection refused: Connect failed
2023-08-06 22:55:32,623| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:55:32,623 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
INFO | 2023-08-06 22:55:37,633 | Checking server antbit-ray-cluster again [3/5].
2023-08-06 22:55:37,805| ERROR   | Secsh channel 3 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:55:37,805 | Secsh channel 3 open FAILED: Connection refused: Connect failed
2023-08-06 22:55:37,809| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:55:37,809 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
INFO | 2023-08-06 22:55:42,820 | Checking server antbit-ray-cluster again [4/5].
2023-08-06 22:55:43,052| ERROR   | Secsh channel 4 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:55:43,052 | Secsh channel 4 open FAILED: Connection refused: Connect failed
2023-08-06 22:55:43,056| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:55:43,056 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
INFO | 2023-08-06 22:55:48,062 | Checking server antbit-ray-cluster again [5/5].
2023-08-06 22:55:48,303| ERROR   | Secsh channel 5 open FAILED: Connection refused: Connect failed
ERROR | 2023-08-06 22:55:48,303 | Secsh channel 5 open FAILED: Connection refused: Connect failed
2023-08-06 22:55:48,307| ERROR   | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')
ERROR | 2023-08-06 22:55:48,307 | Could not establish connection from local ('127.0.0.1', 50053) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed')

@dongreenberg please let me know if you agree with my hypothesis. I would be happy to raise a PR for imporving the error message for this and thanks for building this awesome piece of software :smile:

dongreenberg commented 1 year ago

Hey @Ankit-Dhankhar , that's a good catch and I agree with your hypothesis. We should definitely amend it to python3 or even find the interpreter path. A PR would be super helpful, are you thinking to switch it to python3?

Ankit-Dhankhar commented 1 year ago

Hi @dongreenberg , I'm considering implementing a hot fix in the following manner:

import sys
import shutil

possible_interpreters = ['python', 'python3']

for interpreter in possible_interpreters:
    executable_path = shutil.which(interpreter)
    if executable_path:
        # Execute runhouse.servers.http.http_server using the selected Python interpreter

Should neither of the possible interpreters works, an exception will be raised indicating that deployment has failed due to the inaccessibility of the Python interpreter via the python or python3 command. This approach aims to provide users with clearer insight into the cause of the failure.

rdabane commented 1 year ago

@Ankit-Dhankhar , Thank you for the tip but it did not work in my case.

Here is what I get: INFO | 2023-08-07 14:31:26,788 | No auth token provided, so not using RNS API to save and load configs 2023-08-07 14:31:27,469| INF | MainThrea/1060@sshtunnel | 2 keys loaded from agent INFO | 2023-08-07 14:31:27,469 | 2 keys loaded from agent 2023-08-07 14:31:27,469| INF | MainThrea/1117@sshtunnel | 2 key(s) loaded INFO | 2023-08-07 14:31:27,469 | 2 key(s) loaded 2023-08-07 14:31:27,470| ERR | MainThrea/1314@sshtunnel | Password is required for key /export/lab/.ssh/mlw01.key ERROR | 2023-08-07 14:31:27,470 | Password is required for key /export/lab/.ssh/mlw01.key 2023-08-07 14:31:27,470| INF | MainThrea/0978@sshtunnel | Connecting to gateway: 172.17.10.110:22 as user 'lab' INFO | 2023-08-07 14:31:27,470 | Connecting to gateway: 172.17.10.110:22 as user 'lab' 2023-08-07 14:31:27,470| DEB | MainThrea/0983@sshtunnel | Concurrent connections allowed: True 2023-08-07 14:31:27,470| DEB | MainThrea/1400@sshtunnel | Trying to log in with key: b'a79afb48fad738bfb80ee026219dcdea' 2023-08-07 14:31:27,606| DEB | MainThrea/1204@sshtunnel | Transport socket info: (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0), timeout=0.1 2023-08-07 14:31:27,634| INF | Thread-1/1893@transport | Connected (version 2.0, client OpenSSH_7.6p1) INFO | 2023-08-07 14:31:27,634 | Connected (version 2.0, client OpenSSH_7.6p1) 2023-08-07 14:31:28,070| INF | Thread-1/1893@transport | Authentication (publickey) failed. INFO | 2023-08-07 14:31:28,070 | Authentication (publickey) failed. 2023-08-07 14:31:28,071| DEB | MainThrea/1410@sshtunnel | Authentication error 2023-08-07 14:31:28,071| WAR | MainThrea/1450@sshtunnel | Tunnels are not started. Please .start() first! WARNING | 2023-08-07 14:31:28,071 | Tunnels are not started. Please .start() first! 2023-08-07 14:31:28,071| INF | MainThrea/1474@sshtunnel | Closing ssh transport INFO | 2023-08-07 14:31:28,071 | Closing ssh transport 2023-08-07 14:31:28,071| DEB | MainThrea/1477@sshtunnel | Transport is closed 2023-08-07 14:31:28,072| DEB | MainThrea/1400@sshtunnel | Trying to log in with key: b'463095aa1803da78647cd548f37173ef' 2023-08-07 14:31:28,209| DEB | MainThrea/1204@sshtunnel | Transport socket info: (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0), timeout=0.1 2023-08-07 14:31:28,240| INF | Thread-3/1893@transport | Connected (version 2.0, client OpenSSH_7.6p1) INFO | 2023-08-07 14:31:28,240 | Connected (version 2.0, client OpenSSH_7.6p1) 2023-08-07 14:31:32,198| INF | Thread-3/1893@transport | Authentication (publickey) successful! INFO | 2023-08-07 14:31:32,198 | Authentication (publickey) successful! 2023-08-07 14:31:32,200| INF | Srv-50052/1433@sshtunnel | Opening tunnel: 0.0.0.0:50052 <> 127.0.0.1:50052 INFO | 2023-08-07 14:31:32,200 | Opening tunnel: 0.0.0.0:50052 <> 127.0.0.1:50052 INFO | 2023-08-07 14:31:32,200 | Checking server mlw-cluster 2023-08-07 14:31:32,713| ERR | Thread-3/1893@transport | Secsh channel 0 open FAILED: Connection refused: Connect failed ERROR | 2023-08-07 14:31:32,713 | Secsh channel 0 open FAILED: Connection refused: Connect failed 2023-08-07 14:31:32,713| TRA | Thread-5 /0357@sshtunnel | #1 <-- ('127.0.0.1', 36196) open new channel ssh error: ChannelException(2, 'Connect failed') 2023-08-07 14:31:32,714| ERR | Thread-5 /0394@sshtunnel | Could not establish connection from local ('127.0.0.1', 50052) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed') ERROR | 2023-08-07 14:31:32,714 | Could not establish connection from local ('127.0.0.1', 50052) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed') INFO | 2023-08-07 14:31:32,714 | Server mlw-cluster is up, but the HTTP server may not be up. INFO | 2023-08-07 14:31:32,715 | Restarting HTTP server on mlw-cluster. INFO | 2023-08-07 14:31:32,715 | Running command on mlw-cluster: pkill -f "python -m runhouse.servers.http.http_server" Warning: Permanently added '172.17.10.110' (ED25519) to the list of known hosts. INFO | 2023-08-07 14:31:33,912 | Running command on mlw-cluster: screen -dm bash -c 'python -m runhouse.servers.http.http_server |& tee -a ~/.rh/cluster_server_mlw-cluster.log 2>&1' INFO | 2023-08-07 14:31:39,627 | Checking server mlw-cluster again. 2023-08-07 14:31:39,706| ERR | Thread-3/1893@transport | Secsh channel 1 open FAILED: Connection refused: Connect failed ERROR | 2023-08-07 14:31:39,706 | Secsh channel 1 open FAILED: Connection refused: Connect failed 2023-08-07 14:31:39,706| TRA | Thread-14/0357@sshtunnel | #2 <-- ('127.0.0.1', 35038) open new channel ssh error: ChannelException(2, 'Connect failed') 2023-08-07 14:31:39,707| ERR | Thread-14/0394@sshtunnel | Could not establish connection from local ('127.0.0.1', 50052) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed') ERROR | 2023-08-07 14:31:39,707 | Could not establish connection from local ('127.0.0.1', 50052) to remote ('127.0.0.1', 50052) side of the tunnel: open new channel ssh error: ChannelException(2, 'Connect failed') Traceback (most recent call last): File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 714, in urlopen httplib_response = self._make_request( File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/site-packages/urllib3/connectionpool.py", line 461, in _make_request httplib_response = conn.getresponse() File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 1375, in getresponse response.begin() File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/export/lab/anaconda3/envs/runhouse/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

dongreenberg commented 1 year ago

Hey @Ankit-Dhankhar , that sounds like a solid approach, but I'll note that the python -m line you're referring to is generated on the user's local box but run remotely, so which wouldn't be meaningful there. Maybe you can add it inside the runhouse start command in main.py, and then change the usage of "python -m runhouse.servers.http.http_server" in cluster.py to run "runhouse start" instead?