Open sathyanarays opened 1 month ago
I will open a PR to change the ray job submit
behavior. Currently, if we use the same submission ID for multiple ray job submit
commands, only the first one succeeds while all subsequent attempts fail immediately. I will modify the behavior so that subsequent ray job submit
commands can tail the logs of the running Ray job instead of failing directly.
Sorry for the delay on this patch! I plan to revisit https://github.com/ray-project/ray/pull/45498 this week
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
Observed Behavior
When the driver pod fails, the Rayjob is marked as Running even if the underlying ray job is complete.
Expected Behavior
We expect the kubernetes RayJob object to reflect the status of the Ray job as we see in the head node. In this case, we expect the job to be marked as complete as the underlying ray job completes successfully.
Reproduction script
Steps to reproduce
1. Create a training RayJob
Create a long running training job by create a RayJob on a Kubernetes cluster.
The job used to reproduce this issue can be found here.
2. Delete the driver pod
Wait for the ray job status to be in “RUNNING” state and delete the driver pod.
This step simulates failure of the driver pod. The driver pod could fail because of multiple reasons such as node failure, network interruptions, etc.,.
3. Observe the job status in the head node
Get a shell into the head node and check the status of the job. The job is still in the RUNNING state.
4. Wait for the Ray job to complete
Inside the head node, keep checking for the job to complete. Once the job completes, the status would be similar to the following snippet
Come out of the head node shell and check the RayJob status using kubectl
We can note that RayJob is still in Running state and deployment status is “Failed”
Anything else
Driver logs
Are you willing to submit a PR?