[Bug] RayJob falsely marked as "Running" when driver fails

sathyanarays commented 1 month ago

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

Observed Behavior

When the driver pod fails, the Rayjob is marked as Running even if the underlying ray job is complete.

Expected Behavior

We expect the kubernetes RayJob object to reflect the status of the Ray job as we see in the head node. In this case, we expect the job to be marked as complete as the underlying ray job completes successfully.

Reproduction script

Steps to reproduce

1. Create a training RayJob

Create a long running training job by create a RayJob on a Kubernetes cluster.

kubectl apply -f rayjob.yaml

The job used to reproduce this issue can be found here.

2. Delete the driver pod

Wait for the ray job status to be in “RUNNING” state and delete the driver pod.

kubectl delete pods rayjob-sample-qrppt

This step simulates failure of the driver pod. The driver pod could fail because of multiple reasons such as node failure, network interruptions, etc.,.

3. Observe the job status in the head node

Get a shell into the head node and check the status of the job. The job is still in the RUNNING state.

> ray list jobs

======== List: 2024-05-13 02:06:20.693535 ========
Stats:
------------------------------
Total: 1

Table:
------------------------------
      JOB_ID  SUBMISSION_ID        ENTRYPOINT                               TYPE        STATUS    MESSAGE                    ERROR_TYPE    DRIVER_INFO
 0  02000000  rayjob-sample-8b9r6  python /home/ray/samples/sample_code.py  SUBMISSION  RUNNING   Job is currently running.                id: '02000000'
                                                                                                                                           node_ip_address: 10.244.0.11

4. Wait for the Ray job to complete

Inside the head node, keep checking for the job to complete. Once the job completes, the status would be similar to the following snippet

> ray list jobs

======== List: 2024-05-13 02:33:21.384251 ========
Stats:
------------------------------
Total: 1

Table:
------------------------------
      JOB_ID  SUBMISSION_ID        ENTRYPOINT                               TYPE        STATUS     MESSAGE                     ERROR_TYPE    DRIVER_INFO
 0  02000000  rayjob-sample-8b9r6  python /home/ray/samples/sample_code.py  SUBMISSION  SUCCEEDED  Job finished successfully.                id: '02000000'
                                                                                                                                             node_ip_address: 10.244.0.11
                                                                                                                                             pid: '1688'

Check the RayJob

Come out of the head node shell and check the RayJob status using kubectl

> kubectl get rayjob

NAME            JOB STATUS   DEPLOYMENT STATUS   START TIME             END TIME               AGE
rayjob-sample   RUNNING      Failed              2024-05-13T08:53:33Z   2024-05-13T09:05:18Z   40m

We can note that RayJob is still in Running state and deployment status is “Failed”

Anything else

Driver logs

2024-05-13 02:04:53,418 INFO cli.py:36 -- Job submission server address: http://rayjob-sample-raycluster-wn68n-head-svc.default.svc.cluster.local:8265
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2498, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/cli_utils.py", line 54, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/cli.py", line 272, in submit
    job_id = client.submit_job(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/sdk.py", line 254, in submit_job
    self._raise_error(r)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 283, in _raise_error
    raise RuntimeError(
RuntimeError: Request failed with status code 500: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/job_head.py", line 287, in submit_job
    resp = await job_agent_client.submit_job_internal(submit_request)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/job_head.py", line 80, in submit_job_internal
    await self._raise_error(resp)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/job_head.py", line 68, in _raise_error
    raise RuntimeError(f"Request failed with status code {status}: {error_text}.")
RuntimeError: Request failed with status code 400: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/job_agent.py", line 45, in submit_job
    submission_id = await self.get_job_manager().submit_job(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/job/job_manager.py", line 945, in submit_job
    raise ValueError(
ValueError: Job with submission_id rayjob-sample-8b9r6 already exists. Please use a different submission_id.
.
.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

kevin85421 commented 1 month ago

I will open a PR to change the ray job submit behavior. Currently, if we use the same submission ID for multiple ray job submit commands, only the first one succeeds while all subsequent attempts fail immediately. I will modify the behavior so that subsequent ray job submit commands can tail the logs of the running Ray job instead of failing directly.

kevin85421 commented 1 month ago

https://github.com/ray-project/ray/pull/45498

andrewsykim commented 2 weeks ago

Sorry for the delay on this patch! I plan to revisit https://github.com/ray-project/ray/pull/45498 this week

ray-project / kuberay