ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.05k stars 5.46k forks source link

Enable Docker Restart Policy for Cloud VM Autoscaling Cluster #45246

Open cheehook opened 2 months ago

cheehook commented 2 months ago

Description

Is it a good idea to allow ray users to choose either --restart or --rm policy when bringing up the autoscaling cluster?

Use case

I am setting up a autoscaling cluster on AWS, it works fine but as my VM has a regular system reboot policy, each time after the Ray head node VM is rebooted, the ray container does not start on reboot.

I try to SSH into the head node instance and try to update the container to add restart policy to it by docker update --restart=always ray_container but it is forbidden due to it has a --rm policy Error response from daemon: Cannot update container ac16b077780a170c4792a13f8c6513b15fcb534c69f54a9bca79e5c14a489d47: Restart policy cannot be updated because AutoRemove is enabled for the container

Similarly, if I add --restart=always in docker run option in the autoscaler configuration file, ray up will fail due to the same reason.

Thus, every time after the head node is rebooted, I have to manually do ray down and ray up and update and restart other services with new Ray head IP that need to use the Ray cluster.

rynewang commented 1 month ago

Would you mind sharing your use case? If we restart a container for Ray, it's likely to restart as a fresh, empty cluster with no Ray Jobs which needs user hand holding (e.g. submitting jobs) anyway.

cheehook commented 1 month ago

My organization has a weekly restart policy that will restart all active VMs, including the nodes of Ray Autoscaling Cluster, so I am considering for an approach to cater for this situation.

I have a Ray client setup at another VM, the client has a REST API that receives job requests from users, and then Ray Client will submit the job requests to the Ray Cluster. Therefore as soon as the cluster is restarted, I need it to be ready to handle new job requests.