[ray] Connect to remote servers using ray.init() from local machine

jieru-hu commented 4 years ago

Describe your feature request

It would be great if we can connect to a remote cluster from a local machine directly using ray.init(). For example, say the IP of the ray head node is 52.13.54.127 for my AWS cluster, I want to be able to call ray.init(address="52.13.54.127") and run remote functions from my laptop.

rkooo567 commented 4 years ago

Hi, @jieru-hu ! Thanks for the feature request! Can I ask you a couple questions?

What's the use case of this feature?
Is there any reason why you prefer this way to ray submit?
This can actually impact performance of Ray because depending on contexts, there are several IPCs to execute tasks remotely. Would your use case be able to tolerate the performance degradation?

ericl commented 4 years ago

This is another ask for Ray client @anabranch

jieru-hu commented 4 years ago

Hi, @jieru-hu ! Thanks for the feature request! Can I ask you a couple questions?

Hi @rkooo567 so sorry for the late response!

What's the use case of this feature?

Ray autoscaler is great. But it requires users to sync our code manually and then call ray init on the cluster. Being able to call ray.init from a local machine seems a much better experience.

Is there any reason why you prefer this way to ray submit?

This kind aligns with the above question. Right now I'm using the autoscaler CLI in my code, but I'm hoping Ray could provide us more coding friendly interface for interacting with remote cluster.

This can actually impact performance of Ray because depending on contexts, there are several IPCs to execute tasks remotely. Would your use case be able to tolerate the performance degradation?

In my use case, my remote machine does not do the compute. It's just where we submit the job to ray (by calling ray.init()), all the computation is done remotely. In this scenario, will the preformance still be impacted?

rkooo567 commented 4 years ago

Let me know if I understood you correctly! So you are saying you want to ray.init() on your local machine, and it will only run function.remote() (which is compute heavy) to the ray cluster. Am I correct? So, you want to use it like a task queue?

Vysybyl commented 4 years ago

Please have a look at this. It would be really useful and streamline many use cases.

rkooo567 commented 4 years ago

@Vysybyl it might take some time to have fully working version of this feature, but we are moving towards having a gradual support to this feature!

Vysybyl commented 4 years ago

Thanks for the quick reply, I'm looking forward to it! (In my case I'd like to scale up some interactive computation on jupyter while keeping the code and data files on my local machine. I'm more comfortable NOT copying these files to the remote servers hard-drive, but rather sending only data that will be kept in memory.) I have poked around GH issues trying several workarounds, but none worked. I think I'll setup a VPN for the time being and use my local machine as a 0-resource node.

Maybe I can help with this task? If I understand correctly the main issue is that Ray uses many services on different ports (including random ones) on each node, and nodes must be able to communicate with each other on (many) of these ports, which is unlikely to happen if the machines are not on the same LAN or VPN. Probably the solution would be to have all communication between client (driver) and head node to happen via a single endpoint (which could be opened via SSH port-forwarding, for instance) so that the head node acts like a proxy.

Am I correct? Is there any way I can contribute to this?

Thanks

rkooo567 commented 4 years ago

I think I'll setup a VPN for the time being and use my local machine as a 0-resource node.

Yes. I guess it is the best solution for now. The prototype we are currently working on will also work only when bi-directional communication is allowed between your machine and Ray clusters.

head node acts like a proxy.

The biggest problem here is that Ray's strongly tied to the model where it is running fully within a cluster. That says Ray workers need bi-directional communication (meaning the node with a worker should be able to receive request from other nodes). So, what you mentioned could solve some parts of the problem, but the fundamental problem still remains.

We are also planning to have public roadmap in a few months, so you can definitely check the progress in the sooner future!

Vysybyl commented 4 years ago

Hi! I was able to setup a ray cluster on GCS and connecting to it directly from my local python process (jupyter or shell) without SSH-ing to the remote machine.

In short:

Start ray cluster using the .yaml config file
Install and start OpenVPN to the head node
Start OpenVPN connection on my local machine
Start my local machine as a worker node with 0-resources, passing the head host and redis password
init ray on my local machine

I may post here a more detailed recipe if that could be interesting for anyone

Cheers

bllchmbrs commented 4 years ago

I'd love to see the proof of concept for this! Do you have that recipe that you could share?

ericl commented 4 years ago

Cc @barakmich

Vysybyl commented 4 years ago

Here's the recipe:

Start ray cluster using the .yaml config file

I followed the instructions here. I have modified a few lines in the original yaml file, namely:

a) Installing the latest version form pypi (which is what I use locally), not the wheel from AWS S3 pip install ray

b) Adding a rayvpn network tag to the head node that will later be used to match a firewall rule

...
head_node:
    machineType: n1-standard-2
    tags:
      items:
        # This tag is used to allow VPN connection (it matches a GCS Firewall rule)
        - rayvpn
...

After calling ray up please be sure to take note of the head node IP and reddis password from the console logs.

Create a Firewall Rule on GCS as explained here. Go straight to the "Firewall Rules" part.

SSH to the head node to install OpenVPN (depending on your gcloud settings, you may need to specify the zone, etc.).

export RAY_HEAD=<HEAD_VM_HOSTNAME>
gcloud compute ssh ubuntu@$RAY_HEAD
sudo su
cd /home/ubuntu
curl -O https://raw.githubusercontent.com/angristan/openvpn-install/master/openvpn-install.sh
chmod +x openvpn-install.sh
AUTO_INSTALL=y APPROVE_IP=y PROTOCOL_CHOICE=2 PORT_CHOICE=1 DNS=3 CLIENT=openvpn-client ./openvpn-install.sh
exit
exit

Copy the configuration file on your local machine and start the OpenVPN session (openvpn client must be installed)

gcloud compute scp ubuntu@$RAY_HEAD:/home/ubuntu/openvpn-client.ovpn ~/Downloads/ 
sudo openvpn --config ~/Downloads/openvpn-client.ovpn

Start a ray node on your local computer with 0 resources (or more, if you whish) ray start --address=':' --redis-password='' --object-manager-port=8076 --num-cpus=0 --num-gpus=0

Start ray API in python

import ray
ray.init(address='auto', redis_password='<REDIS_PASSWORD>')
ray.nodes()  # To confirm that you are actually using the cluster

I was NOT able to connect to the dashboard. If anybody can fix this I'll appreciate it!

Enjoy

edornd commented 4 years ago

This issue was interesting, as I was trying to achieve something similar. I'd like to containerize ray in such a way that the actual cluster is independent from my web server. Now, I'm currently starting a head node in the same container of a flask web app using ray start --head --num-cpus=0, in order to make sure that this node won't execute any actual work, but it feels like a workaround. I guess there's no way, for now, to connect to a remote cluster and call remote functions like a task queue (e.g. celery style)?

ediezh commented 4 years ago

@Vysybyl tried your solution, and it worked! To access the dashboard, you can either do a ssh port forwarding, ssh -L 8266:localhost:8266 <user>@<remote-head-ip>, then you will be able to access dashboard http://localhost:8266 locally.

Or use --dashboard-host=0.0.0.0 on the head node when you start ray. Then you can access the dahsboard on http://<remote-head-openvpn-ip>:8266.

carlos-vl commented 4 years ago

I am also interested in using ray as a task queue to transparently offload heavy computations from local machines (e.g., laptops) to a server/cluster on the same network. I planned to do this as mentioned above, by connecting the laptops to the cluster by running ray start with --num_cpus=0 to ensure that no processing runs at the laptops.
My main doubt is: could the laptops run ray on windows, and the remote server (i.e., the cluster head) run ray on linux? Or are both versions not compatible with each other? I could not find this in the documentation and cannot test it myself right now.

In case you want more datails on my use case: I develop an app which currently uses ray internally for parallel computing (heavy computations on large numpy arrays) on the local computer. I chose ray so that it can also scale well to large workstations and clusters. This app has a GUI which is used by users without much technical knowledge. I would like to have a ray cluster running on the same network, and allow the users running my app on their laptops to offload their computations to this cluster as easily as possible. Ideally ray would also handle the available hardware resources when several users are submitting tasks at the same time (but this is not a must).

I planned to try something similar to what was proposed above. I would be interested on feedback to know wheather this is feasible and whether it is a good idea:

Users specify the cluster IP address on the GUI, and I will connect to it running ray start --address=ip --redis-password=xxxx --num_cpus=0 on the background . This ensures that no processing runs on the laptops, and all is offloaded to the cluster.
Then, the python code simply runs ray.init(address=ip, _redis_password=xxxx) to connect to the ray cluster.
My app running on a laptop then creates ray actors or calls remote functions as I would normally do, but now they run on the cluster instead.
I expect little performance loss, as there should be little data exchange between the local computers and the server. The ray actors running on the cluster will read the input data themselves, and write the results to disk (the data lies on a network share anyways).

Thanks!

ghislainp commented 3 years ago

I'm also interested by this feature, to run remote functions from my laptop to a HPC cluster. I want to do this iteractively from a jupyter notebook. I don't have root access on the cluster, VPN is not an option.

I've already set up such a solution with dask.distributed. Two tunnels (one for the central scheduler and one for the dashboard) suffice to make it work. This is convenient but I wanted to test the performance with Ray.

Yhinner commented 3 years ago

Hi guys,

In order to do what @Vysybyl proposed, the local machine has to have exact ray and python version as installed in remote cluster, otherwise it would throw version mismatch exception. Is there an easy way to make it such a way that my local machine can connect to arbitrary ray clusters (regardless of the ray and python minor versions)?

ericl commented 3 years ago

You can try Ray client, which is a bit more tolerant of Ray version skew: https://github.com/ray-project/ray/issues/13324

However, Python version still has to match due to serialization issues.

Yhinner commented 3 years ago

Oh that is perfect! Thanks for sharing the good news :)!

richardliaw commented 3 years ago

As Eric mentioned, this should be addressed with https://docs.ray.io/en/master/cluster/ray-client.html

Ivorforce commented 3 years ago

To save anybody coming across this some time in setting up, I share my findings. Example with region us-west-2, and port 6379 (default for the yaml examples):

ray up <your-ray-file>.yaml  # extract public-ip and redis-password from output here
aws ec2 describe-instances --query 'Reservations[*].Instances[*].SecurityGroups[*]' --output text --region us-west-2
aws ec2 authorize-security-group-ingress --group-id '<sg-xxxxxxxxxxx>' --protocol tcp --port 6379 --cidr 0.0.0.0/0 --region us-west-2
ray start --address='<public-ip>:6379' --redis-password='<redis-password>'

mirekphd commented 2 years ago

To do in 2022 what H2O could do in 2016:) And in R too...

What's the use case of this [remote API server connections] feature?

And to become compatible with the modern microservices architecture, where the client is running in a separate container from the server, with its own IP and its own app (e.g. Jupyter Notebook).

ray-project / ray

[ray] Connect to remote servers using ray.init() from local machine #8035

Describe your feature request