Open andrewsykim opened 2 weeks ago
I just had a quick glance, and I think we should reuse the code path of the existing state machine. To elaborate, we can add a new state rayv1.JobDeploymentStatusRestarting
, which goes through a similar path as rayv1.JobDeploymentStatusSuspending
. In addition, all the state machine transition should be finished in the switch rayJobInstance.Status.JobDeploymentStatus
statement. Currently, restartRayJobOnFailure
is not in the switch
statement.
case rayv1.JobDeploymentStatusSuspending, rayv1.JobDeploymentStatusRestarting:
// Delete RayCluster and submitter K8s Job
...
// Reset the RayCluster and Ray job related status.
...
// Reset the JobStatus to JobStatusNew
...
// Transition the JobDeploymentStatus to `Suspended` if the status is `Suspending`.
// Transition the JobDeploymentStatus to `New` if the status is `Restarting`.
rayJobInstance.Status.JobDeploymentStatus = ...
@kevin85421 the reason I didn't use suspend is that I don't think it will play nicely with Kueue because suspend means each retry goes to the back of the job queue instead of immediate retry on failure
Actually,I think I misunderstood what you said, we can use the same state machine logic for suspend without actually suspending the RayJob, will update the PR to do that. I'll also move restartRayJobOnFailure into the switch statement
@kevin85421 I incorporated your feedback, if the overall approach looks good to you I can polish the PR and add tests.
So far I've manually verified this works:
$ kubectl get rayjob -w
NAME JOB STATUS DEPLOYMENT STATUS RAY CLUSTER NAME START TIME END TIME AGE
rayjob-sample Initializing rayjob-sample-raycluster-qhdx2 2024-06-13T14:42:08Z 6s
rayjob-sample Running rayjob-sample-raycluster-qhdx2 2024-06-13T14:42:08Z 76s
rayjob-sample PENDING Running rayjob-sample-raycluster-qhdx2 2024-06-13T14:42:08Z 78s
rayjob-sample FAILED Failed rayjob-sample-raycluster-qhdx2 2024-06-13T14:42:08Z 2024-06-13T14:43:32Z 84s
rayjob-sample FAILED Restarting rayjob-sample-raycluster-qhdx2 2024-06-13T14:42:08Z 2024-06-13T14:43:32Z 84s
rayjob-sample 2024-06-13T14:42:08Z 2024-06-13T14:43:32Z 84s
rayjob-sample Initializing rayjob-sample-raycluster-mqt2l 2024-06-13T14:43:32Z 2024-06-13T14:43:32Z 84s
rayjob-sample Running rayjob-sample-raycluster-mqt2l 2024-06-13T14:43:32Z 2024-06-13T14:43:32Z 110s
rayjob-sample PENDING Running rayjob-sample-raycluster-mqt2l 2024-06-13T14:43:32Z 2024-06-13T14:43:32Z 112s
rayjob-sample FAILED Failed rayjob-sample-raycluster-mqt2l 2024-06-13T14:43:32Z 2024-06-13T14:44:09Z 2m1s
rayjob-sample FAILED Restarting rayjob-sample-raycluster-mqt2l 2024-06-13T14:43:32Z 2024-06-13T14:44:09Z 2m1s
rayjob-sample 2024-06-13T14:43:32Z 2024-06-13T14:44:09Z 2m1s
rayjob-sample Initializing rayjob-sample-raycluster-x576d 2024-06-13T14:44:09Z 2024-06-13T14:44:09Z 2m1s
rayjob-sample Running rayjob-sample-raycluster-x576d 2024-06-13T14:44:09Z 2024-06-13T14:44:09Z 2m27s
rayjob-sample PENDING Running rayjob-sample-raycluster-x576d 2024-06-13T14:44:09Z 2024-06-13T14:44:09Z 2m28s
rayjob-sample RUNNING Running rayjob-sample-raycluster-x576d 2024-06-13T14:44:09Z 2024-06-13T14:44:09Z 2m34s
rayjob-sample FAILED Failed rayjob-sample-raycluster-x576d 2024-06-13T14:44:09Z 2024-06-13T14:44:45Z 2m37s
rayjob-sample FAILED Restarting rayjob-sample-raycluster-x576d 2024-06-13T14:44:09Z 2024-06-13T14:44:45Z 2m37s
rayjob-sample 2024-06-13T14:44:09Z 2024-06-13T14:44:45Z 2m37s
rayjob-sample Initializing rayjob-sample-raycluster-89284 2024-06-13T14:44:45Z 2024-06-13T14:44:45Z 2m37s
rayjob-sample Running rayjob-sample-raycluster-89284 2024-06-13T14:44:45Z 2024-06-13T14:44:45Z 3m3s
rayjob-sample PENDING Running rayjob-sample-raycluster-89284 2024-06-13T14:44:45Z 2024-06-13T14:44:45Z 3m5s
rayjob-sample FAILED Failed rayjob-sample-raycluster-89284 2024-06-13T14:44:45Z 2024-06-13T14:45:22Z 3m14s
rayjob-sample FAILED Restarting rayjob-sample-raycluster-89284 2024-06-13T14:44:45Z 2024-06-13T14:45:22Z 3m14s
rayjob-sample 2024-06-13T14:44:45Z 2024-06-13T14:45:22Z 3m14s
rayjob-sample Initializing rayjob-sample-raycluster-xdgv9 2024-06-13T14:45:22Z 2024-06-13T14:45:22Z 3m14s
Why are these changes needed?
Add a new field
spec.backOffLimit
to RayJob for retrying failed jobs. A retry involves deleting and recreating the RayCluster.Marking WIP since I haven't added tests yet
Related issue number
Fixes https://github.com/ray-project/kuberay/issues/1902
Checks