Open andrewsykim opened 10 months ago
Thoughts @kevin85421 ?
The suspend
feature in RayJob will issue a request to the Ray head Pod to halt the job before the RayCluster is deleted. For RayCluster, I prefer to avoid doing too many things on the data plane (i.e. Ray). If users want to suspend a RayCluster, they should make sure all jobs are stopped by themselves.
Btw, this is pertaining to deletion, not suspension.
@kevin85421 here's the use-case I am thinking about it:
ray.io/wait-for-job-completion
and then runs the delete command kubectl delete raycluster my-cluster
. Note that the finalizer would be optional and blocking deletion on job completion is not default behavior. I agree with your previous comment that we don't need to cover this for suspension
We has the similar use-case:
Search before asking
Description
I would like to introduce a finalizer that can be used with RayCluster to block deletion until all jobs in the Ray cluster are completed.
Use case
This feature would allow you to delete a Ray cluster while jobs are still running. The finalizer will ensure that all jobs are completed before cleaning up resources by querying the Ray head service. This is handy for when you want to automatically clean up resource immediately after a long-running training job. Even more important for larger jobs where resources need to be cleaned up as soon as possible to save costs.
This can also be used as a safety measure to ensure RayClusters with running jobs can't be accidentally deleted.
While RayJob can be used for similar use-cases, it is not a viable option for longer-lived RayClusters that can accept multiple jobs before being deleted.
Related issues
No response
Are you willing to submit a PR?