run-house / runhouse

Like PyTorch for building ML systems. Iterable, debuggable, multi-cloud, 100% reproducible across research and production.
https://run.house
Apache License 2.0
943 stars 37 forks source link

SSH ProxyCommand support #84

Open gopitk opened 11 months ago

gopitk commented 11 months ago

I am trying Run House with a local pre-configured server. But that server needs to use "ProxyCommand" option to SSH into. Is there a way the PxoxyCommand can be specified in the Cluster API (like in the ssh_creds dict)?

Typical way to SSH into the server is something like this:

ssh -i -o ProxyCommand="ssh -W %h:%p \user>>@\<frontendproxyhost" \user>@\<targethost>

I do have a workaround to add the ProxyCommand in ~/.ssh/config but would be nice to specify as params in the rh.cluster API for cases where the SSH command are a bit dynamic (like in my case).

jlewitt1 commented 11 months ago

Thanks for bringing this up! This is something we should definitely support. Would adding a separate parameter for the proxy command work for your use case?

https://github.com/run-house/runhouse/commit/0dc953e6e89e5856e2c73b67e80605e418113e8f

dongreenberg commented 11 months ago

We actually already support this, but haven't explicitly documented it. Can you try adding "ssh_proxy_command": "{proxy string}" to the ssh_creds dictionary?

One caveat - if you're using folder objects (or blob or table, which depend on folder), we haven't yet added this, but are actually in the process of significantly expanding our SSH flexibility, and likely will release it within the week.

gopitk commented 11 months ago

Thanks for the tipcs @jlewitt1 and @dongreenberg . I tried to add the "ssh_proxy_command" to the ssh_creds dict. It seem to work (in terms of seeing an SSH connect) but then it threw an exception when RH tried to check connectivity again. This is the exception. I dont see this when I have the ProxyCommand in my ~/.ssh/config. BTW - I have a couple of other SSH options (-o ) that I have in the config file which I did not have a way to pass in in the dict.

INFO | 2023-07-12 13:23:29,522 | Checking server myvm again.
---------------------------------------------------------------------------
BaseSSHTunnelForwarderError               Traceback (most recent call last)
File ~/miniconda3/envs/rh/lib/python3.9/site-packages/runhouse/rns/hardware/cluster.py:357, in Cluster.check_server(self, restart_server)
    356 try:
--> 357     self.connect_server_client()
    358     cluster_config = self.config_for_rns

File ~/miniconda3/envs/rh/lib/python3.9/site-packages/runhouse/rns/hardware/cluster.py:324, in Cluster.connect_server_client(self, tunnel, force_reconnect)
    323 else:
--> 324     self._rpc_tunnel, connected_port = self.ssh_tunnel(
    325         HTTPClient.DEFAULT_PORT,
    326         remote_port=DEFAULT_SERVER_PORT,
    327         num_ports_to_try=5,
    328     )
    329 open_cluster_tunnels[self.address] = (
    330     self._rpc_tunnel,
    331     connected_port,
    332     tunnel_refcount + 1,
    333 )

AttributeError                            Traceback (most recent call last)

gpu = rh.cluster(....)
File ~/miniconda3/envs/rh/lib/python3.9/site-packages/runhouse/rns/hardware/cluster_factory.py:59, in cluster(name, ips, ssh_creds, dryrun, **kwargs)
     50 if {"instance_type", "num_instances", "provider"} <= kwargs.keys():
     51     # Commenting out for now. If two creation paths creates confusion let's push people to use
     52     # ondemand_cluster() instead.
   (...)
     55     #     "If you would like to create an on-demand cluster, please use `rh.ondemand_cluster()` instead."
     56     # )
     57     return ondemand_cluster(name=name, **kwargs)
---> 59 return Cluster(ips=ips, ssh_creds=ssh_creds, name=name, dryrun=dryrun)

File ~/miniconda3/envs/rh/lib/python3.9/site-packages/runhouse/rns/hardware/cluster.py:58, in Cluster.__init__(self, name, ips, ssh_creds, dryrun, **kwargs)
     55 self.client = None
     57 if not dryrun and self.address:
---> 58     self.check_server()
     59     # OnDemandCluster will start ray itself, but will also set address later, so won't reach here.
     60     self.start_ray()

File ~/miniconda3/envs/rh/lib/python3.9/site-packages/runhouse/rns/hardware/cluster.py:379, in Cluster.check_server(self, restart_server)
    377     self.restart_server(resync_rh=False)
    378     logger.info(f"Checking server {self.name} again.")
--> 379     self.client.check_server(cluster_config=cluster_config)
    380 else:
    381     raise ValueError(f"Could not connect to cluster <{self.name}>")

AttributeError: 'NoneType' object has no attribute 'check_server'
dongreenberg commented 11 months ago

Oh great point, we aren't passing the proxy into the tunnel. I can patch that and the options up shortly. Out of curiosity, you said this all works (through to running the remote function) when you've provided the options and proxy info in your SSH config?

gopitk commented 11 months ago

Actually remote function also did not work with my ~/.ssh/config too. It hung for a long time. What worked was the setup of the functions (like creating the pip installs) and cluster.run_python.

dongreenberg commented 11 months ago

Got it. I've been banging on this and the tunneling library we use (ironically to handle different credentials scenarios....) doesn't support proxies nicely (as in, it looks like it does, but I spent hours debugging and it still wouldn't proxy correctly despite working from the command line). I've implemented a fix (https://github.com/run-house/runhouse/pull/85) going directly though the command line to remove that discrepancy, but I'll want to test it a bit further before releasing because it's a core execution path. Adding more ssh options is straightforward and I'll push that too shortly. If you're blocked and would like to give it a try so far, please feel free here: pip install git+https://github.com/run-house/runhouse.git@proxy_tunneling

gopitk commented 11 months ago

Thanks @dongreenberg for the quick fix. I think there is still some issues after I used the proxy_tunneling branch. Good news is that the e2e (The stable diffusion tutorial) runs fine with remote functions etc when I use ~/.ssh/config to specify my ProxyCommand.

However when I use ssh_proxy_command dict item to pass that info I get some error in how the ssh /bash command is constructed. Seems like it is looking for SSH in my current dir.

/bin/bash: /home/user/runhouse/tutorials/t01_Stable_Diffusion/ssh -i ~/.ssh/id_rsa -W \:\ -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \: No such file or directory

Then I see the remote pkill and start of the runhouse http server go thru fine. This is followed by a connection refused error.

INFO | 2023-07-13 13:11:47,807 | Checking server myvm again.

ConnectionRefusedError Traceback (most recent call last) File ~/miniconda3/envs/rh/lib/python3.9/site-packages/urllib3/connection.py:174, in HTTPConnection._new_conn(self) 173 try: --> 174 conn = connection.create_connection( 175 (self._dns_host, self.port), self.timeout, **extra_kw 176 ) 178 except SocketTimeout:

dongreenberg commented 11 months ago

Oh that is interesting. Glad to hear it works with the .ssh/config, and I appreciate you helping us through the dict case too. I think I spotted the error and just pushed a fix to the branch. It runs through on my side, but I've set up a phony jumpbox to test it, so it's really helpful that you've tried in on yours.

gopitk commented 11 months ago

My environment is a bit custom (not sure how common it is). The target hostname (I pass in the ips field) is somewhat dynamic in nature (but follows some pattern which I specify in the .ssh/config) and wont resolve to any known IP address locally on my client machine and is only meaningful to the proxy host (which has a way to resolve these dynamic target hostname and route it correctly to my target server). As a result if I dont use the ~/.ssh/config and let runhouse use the ssh_proxy_command, my proxy host somehow seems to not resolve the target host passed in the -W option of proxycommand and returns a "Could not resolve IP address for : Name or service not known".

The SSH client I run from command line passes the dynamic host name to the Proxy command (as I have -W %h:%p in the proxy command) and there I dont see my proxy failing to resolving the target.

For now, I can use the ProxyCommand in ~/.ssh/config for now which seems to be working great for me with Runhouse as I was able to run several of the tutorials remotely. I am happy to add some logging on Runhouse locally to see how it is constructing the full SSH commands so I can check diff between my ssh/config setup vs passing ssh_proxy_command in rh.cluster. Where can I find that so I can debug it in my env to further isolate?

dongreenberg commented 11 months ago

Good point, we should log the SSH commands, and I'm curious why they wouldn't be resolving the same way as through the command line. I have a gnarly commit in the works on a separate branch that I'll land shortly, and then add that logging and push to this branch. In general, would you say it's preferable for our ssh activity to run through the command line so it's consistent with whatever you know you can do directly, rather than use tools which depend on Python SSH tools (e.g. Paramiko, asyncssh)?