ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
963 stars 328 forks source link

[Bug] Fail the job, if the head node crashes #2161

Closed peterghaddad closed 3 weeks ago

peterghaddad commented 1 month ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

When the head node crashes due to OOM or other reasons, the cluster head node will respawn and the job will not gracefully resume when GCS fault tolerance is disabled. The job does not resume; however, the cluster recovers leaving lingering resources until the job is manually deleted.

Add in support for that when GCS is disabled, Ray Jobs fail when there are head node interruptions.

Reproduction script

Create a Ray job CRD with GCS fault tolerance disabled on the operator, delete the head node once the job begins. The job will stay in a running state until the job is manually deleted.

Anything else

No response

Are you willing to submit a PR?

kevin85421 commented 1 month ago

until the job is manually deleted.

peterghaddad commented 3 weeks ago

Yes, mean KubeRay CRD.

Using KubeRay v1.0.0 and upgrading to V1.1.1 resolved these problems.