Open cheehook opened 2 months ago
Would you mind sharing your use case? If we restart a container for Ray, it's likely to restart as a fresh, empty cluster with no Ray Jobs which needs user hand holding (e.g. submitting jobs) anyway.
My organization has a weekly restart policy that will restart all active VMs, including the nodes of Ray Autoscaling Cluster, so I am considering for an approach to cater for this situation.
I have a Ray client setup at another VM, the client has a REST API that receives job requests from users, and then Ray Client will submit the job requests to the Ray Cluster. Therefore as soon as the cluster is restarted, I need it to be ready to handle new job requests.
Description
Is it a good idea to allow ray users to choose either
--restart
or--rm
policy when bringing up the autoscaling cluster?Use case
I am setting up a autoscaling cluster on AWS, it works fine but as my VM has a regular system reboot policy, each time after the Ray head node VM is rebooted, the
ray container
does not start on reboot.I try to SSH into the head node instance and try to update the container to add restart policy to it by
docker update --restart=always ray_container
but it is forbidden due to it has a--rm
policyError response from daemon: Cannot update container ac16b077780a170c4792a13f8c6513b15fcb534c69f54a9bca79e5c14a489d47: Restart policy cannot be updated because AutoRemove is enabled for the container
Similarly, if I add
--restart=always
indocker run option
in the autoscaler configuration file,ray up
will fail due to the same reason.Thus, every time after the head node is rebooted, I have to manually do
ray down
andray up
and update and restart other services with new Ray head IP that need to use the Ray cluster.