[X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
When the head node crashes due to OOM or other reasons, the cluster head node will respawn and the job will not gracefully resume when GCS fault tolerance is disabled. The job does not resume; however, the cluster recovers leaving lingering resources until the job is manually deleted.
Add in support for that when GCS is disabled, Ray Jobs fail when there are head node interruptions.
Reproduction script
Create a Ray job CRD with GCS fault tolerance disabled on the operator, delete the head node once the job begins. The job will stay in a running state until the job is manually deleted.
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When the head node crashes due to OOM or other reasons, the cluster head node will respawn and the job will not gracefully resume when GCS fault tolerance is disabled. The job does not resume; however, the cluster recovers leaving lingering resources until the job is manually deleted.
Add in support for that when GCS is disabled, Ray Jobs fail when there are head node interruptions.
Reproduction script
Create a Ray job CRD with GCS fault tolerance disabled on the operator, delete the head node once the job begins. The job will stay in a running state until the job is manually deleted.
Anything else
No response
Are you willing to submit a PR?