ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.17k stars 376 forks source link

[Feature] Finalizer to block deletion of RayCluster with running jobs #1740

Open andrewsykim opened 10 months ago

andrewsykim commented 10 months ago

Search before asking

Description

I would like to introduce a finalizer that can be used with RayCluster to block deletion until all jobs in the Ray cluster are completed.

Use case

This feature would allow you to delete a Ray cluster while jobs are still running. The finalizer will ensure that all jobs are completed before cleaning up resources by querying the Ray head service. This is handy for when you want to automatically clean up resource immediately after a long-running training job. Even more important for larger jobs where resources need to be cleaned up as soon as possible to save costs.

This can also be used as a safety measure to ensure RayClusters with running jobs can't be accidentally deleted.

While RayJob can be used for similar use-cases, it is not a viable option for longer-lived RayClusters that can accept multiple jobs before being deleted.

Related issues

No response

Are you willing to submit a PR?

andrewsykim commented 9 months ago

Thoughts @kevin85421 ?

kevin85421 commented 9 months ago

The suspend feature in RayJob will issue a request to the Ray head Pod to halt the job before the RayCluster is deleted. For RayCluster, I prefer to avoid doing too many things on the data plane (i.e. Ray). If users want to suspend a RayCluster, they should make sure all jobs are stopped by themselves.

andrewsykim commented 9 months ago

Btw, this is pertaining to deletion, not suspension.

andrewsykim commented 9 months ago

@kevin85421 here's the use-case I am thinking about it:

Note that the finalizer would be optional and blocking deletion on job completion is not default behavior. I agree with your previous comment that we don't need to cover this for suspension

chenk008 commented 8 months ago

We has the similar use-case:

  1. The RayCluster is shared by some tenants, they will submit job to it.
  2. When the RayCluster is deleting, pods should not be deleted before all jobs is completed
  3. New job will be forbidden