Closed architkulkarni closed 9 months ago
cc @kevin85421, I think I need to fix this before the release.
Actually @kevin85421 I think this behavior is intended. Here's the relevant code:
If shutdownAfterJobFinishes
is false (Default), then the cluster remains running, so the JobDeploymentStatus is still Running, regardless of the status of the underlying Ray Job.
If shutdownAfterJobFinishes
is true
, then the JobDeploymentStatus
transitions to Complete
as expected.
In this case, I think we should leave this behavior the way it is in 0.6.0 instead of making a breaking change. Later, we can consider defining the JobDeploymentStatus in a better way (I think the current definition is a bit confusing, but I'm not sure. Maybe k8s users would expect the deployment status to be "Running" for as long as the RayCluster is up.)
What do you think?
Can you explain the difference between Job Deployment Status
and Job Status
?
JobStatus
is the Ray Job Status pulled directly from Ray. https://docs.ray.io/en/latest/cluster/running-applications/job-submission/doc/ray.job_submission.JobStatus.html https://github.com/ray-project/kuberay/blob/384a921c2e2be7a84b5fc55f832a487186475eb7/ray-operator/apis/ray/v1alpha1/rayjob_types.go#L11-L21
JobDeploymentStatus
is broader and can include statuses outside the scope of Ray, for example "FailedToGetOrCreateRayCluster", or "Suspended": https://github.com/ray-project/kuberay/blob/384a921c2e2be7a84b5fc55f832a487186475eb7/ray-operator/apis/ray/v1alpha1/rayjob_types.go#L32-L46
@architkulkarni I'm using Flyte to run a ray task. It creates a RayJob CR on my cluster with shutdownAfterJobFinishes
set to true. However, I see that the Job Deployment Status
stays at RUNNING
for long even if Job Status
is either SUCCEEDED
or FAILED
(and hence the cluster doesn't get deleted after the job finishes).
The CR also has ttlSecondsAfterFinished
set to 3600. Can you explain how ttlSecondsAfterFinished
and shutdownAfterJobFinishes
relate to each other? Does the Job Deployment Status
transition to COMPLETE
after ttlSecondsAfterFinished
or is it supposed to transition immediately after the Job Status
concludes?
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
Start the kuberay operator, and then run
kubectl apply -f /Users/archit/kuberay/ray-operator/config/samples/ray_v1alpha1_rayjob.yaml
after modifying this YAML to include a typo or some other bug in theentrypoint
field (to cause an application error.Then
kubectl describe rayjob rayjob-sample
will showThe expected behavior is the status is Complete, not Running.
Likely related, the kuberay operator log prints the following in a loop every 3 seconds:
Reproduction script
Above
Anything else
Happens every time
Are you willing to submit a PR?