A use case of Ray Cluster Launcher for deploying a Ray cluster on an on-premise cluster.
This project assumes that your client machine (e.g, personal laptop) has ssh access to two or more on-premise servers.
The on-premise servers must have the following setup:
Clone this repository
git clone https://github.com/jacksonjacobs1/ray-cluster-launcher.git
Change directory to the repository
cd ray-cluster-launcher
Crete a virtual environment and activate it
python3 -m venv venv
source venv/bin/activate
Install the dependencies
pip install -r requirements.txt
Cluster setup is a user-level procedure, as opposed to a system-level procedure: Each user must set up their cluster individually. The client machine is used to launch the cluster, so SSH passwordless login must be enabled between the client machine and all cluster nodes.
This can be done by generating a public/private key pair on the client machine and copying the public key to the cluster nodes. See here for more details. Here is an example of how this may be done:
On the client machine, generate a public/private key pair and press enter through the prompts. It's important to leave the passphrase blank.
ssh-keygen -t rsa -b 4096 -C "<insert-key-identifier-here>"
Copy the public key to each node in the cluster.
ssh-copy-id -i ~/.ssh/id_rsa.pub <head-node-username>@<head-node-ip-address>
ssh-copy-id -i ~/.ssh/id_rsa.pub <worker-node-username>@<worker-node-ip-address>
Test that passwordless login works.
ssh -i ~/.ssh/id_rsa <head-node-username>@<head-node-ip-address>
The cluster configuration is defined in the local_cluster_config.yaml
file. See here for more configuration examples.
You will need to modify the following parameters for your own on-premise cluster:
head_ip: <ip-or-hostname-of-head-node>
worker_ips: [<ip-or-hostname-of-worker-node-1>, <ip-or-hostname-of-worker-node-2>, ...]
ssh_user: <username>
min_workers: <number of workers>
max_workers: <number of workers>
To launch the cluster, run the following command:
ray up local_cluster_config.yaml
To verify that the cluster is running, attach to the head node:
ray attach local_cluster_config.yaml
Check the status of the cluster:
(base) ray@SOMAI-SERV01:~$ ray status
======== Autoscaler status: 2023-12-08 14:29:16.383004 ========
Node status
---------------------------------------------------------------
Active:
2 local.cluster.node
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/64.0 CPU
0.0/4.0 GPU
0B/203.37GiB memory
0B/37.92GiB object_store_memory
Demands:
(no resource demands)
To teardown the cluster, run the following command from the client machine:
ray down local_cluster_config.yaml
Ray down does not always terminate the worker nodes properly, as documented here. Incomplete termination can cause issues while launching subsequent clusters. To check if the cluster has terminated properly, follow these steps:
ssh into a worker node
ssh <worker-node-username>@<worker-node-ip-address>
Check for a hanging docker container.
docker ps | grep ray_container
If the container is still running, stop it.
docker stop ray_container