ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.14k stars 368 forks source link

[Bug] RayJob does not enter `Complete` state after job application failure #1233

Closed architkulkarni closed 9 months ago

architkulkarni commented 1 year ago

Search before asking

KubeRay Component

ray-operator

What happened + What you expected to happen

Start the kuberay operator, and then run kubectl apply -f /Users/archit/kuberay/ray-operator/config/samples/ray_v1alpha1_rayjob.yaml after modifying this YAML to include a typo or some other bug in the entrypoint field (to cause an application error.

Then kubectl describe rayjob rayjob-sample will show

  Job Deployment Status:  Running
  Job Id:                 rayjob-sample-prbts
  Job Status:             FAILED
  Message:                Job failed due to an application error, last available logs (truncated to 20,000 chars):
python: can't open file '/home/ray/samples/sample_code.ppy': [Errno 2] No such file or directory

The expected behavior is the status is Complete, not Running.

Likely related, the kuberay operator log prints the following in a loop every 3 seconds:

2023-07-11T17:58:08.015Z        INFO    controllers.RayJob      reconciling RayJob  {"NamespacedName": "default/rayjob-sample"}
2023-07-11T17:58:08.015Z        INFO    controllers.RayJob      Found associated RayCluster for RayJob       {"rayjob": "rayjob-sample", "raycluster": "default/rayjob-sample-raycluster-2h7ds"}
2023-07-11T17:58:08.016Z        INFO    controllers.RayJob      K8s job successfully retrieved       {"RayJob": "rayjob-sample", "jobId": "rayjob-sample"}

Reproduction script

Above

Anything else

Happens every time

Are you willing to submit a PR?

architkulkarni commented 1 year ago

cc @kevin85421, I think I need to fix this before the release.

architkulkarni commented 1 year ago

Actually @kevin85421 I think this behavior is intended. Here's the relevant code:

https://github.com/ray-project/kuberay/blob/a0e59be1a0c94507d8a32ee059b92effeb67f09a/ray-operator/controllers/ray/rayjob_controller.go#L128-L154

If shutdownAfterJobFinishes is false (Default), then the cluster remains running, so the JobDeploymentStatus is still Running, regardless of the status of the underlying Ray Job.

If shutdownAfterJobFinishes is true, then the JobDeploymentStatus transitions to Complete as expected.

In this case, I think we should leave this behavior the way it is in 0.6.0 instead of making a breaking change. Later, we can consider defining the JobDeploymentStatus in a better way (I think the current definition is a bit confusing, but I'm not sure. Maybe k8s users would expect the deployment status to be "Running" for as long as the RayCluster is up.)

What do you think?

kevin85421 commented 1 year ago

Can you explain the difference between Job Deployment Status and Job Status?

architkulkarni commented 1 year ago

JobStatus is the Ray Job Status pulled directly from Ray. https://docs.ray.io/en/latest/cluster/running-applications/job-submission/doc/ray.job_submission.JobStatus.html https://github.com/ray-project/kuberay/blob/384a921c2e2be7a84b5fc55f832a487186475eb7/ray-operator/apis/ray/v1alpha1/rayjob_types.go#L11-L21

JobDeploymentStatus is broader and can include statuses outside the scope of Ray, for example "FailedToGetOrCreateRayCluster", or "Suspended": https://github.com/ray-project/kuberay/blob/384a921c2e2be7a84b5fc55f832a487186475eb7/ray-operator/apis/ray/v1alpha1/rayjob_types.go#L32-L46

aybidi commented 1 year ago

@architkulkarni I'm using Flyte to run a ray task. It creates a RayJob CR on my cluster with shutdownAfterJobFinishes set to true. However, I see that the Job Deployment Status stays at RUNNING for long even if Job Status is either SUCCEEDED or FAILED (and hence the cluster doesn't get deleted after the job finishes).

The CR also has ttlSecondsAfterFinished set to 3600. Can you explain how ttlSecondsAfterFinished and shutdownAfterJobFinishes relate to each other? Does the Job Deployment Status transition to COMPLETE after ttlSecondsAfterFinished or is it supposed to transition immediately after the Job Status concludes?