ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
963 stars 328 forks source link

[RayJob] Add spec.backoffLimit for retrying RayJobs with new clusters #2192

Open andrewsykim opened 2 weeks ago

andrewsykim commented 2 weeks ago

Why are these changes needed?

Add a new field spec.backOffLimit to RayJob for retrying failed jobs. A retry involves deleting and recreating the RayCluster.

Marking WIP since I haven't added tests yet

Related issue number

Fixes https://github.com/ray-project/kuberay/issues/1902

Checks

kevin85421 commented 2 weeks ago

I just had a quick glance, and I think we should reuse the code path of the existing state machine. To elaborate, we can add a new state rayv1.JobDeploymentStatusRestarting, which goes through a similar path as rayv1.JobDeploymentStatusSuspending. In addition, all the state machine transition should be finished in the switch rayJobInstance.Status.JobDeploymentStatus statement. Currently, restartRayJobOnFailure is not in the switch statement.

case rayv1.JobDeploymentStatusSuspending, rayv1.JobDeploymentStatusRestarting:
    // Delete RayCluster and submitter K8s Job
    ...
    // Reset the RayCluster and Ray job related status.
    ...
    // Reset the JobStatus to JobStatusNew
    ...
    // Transition the JobDeploymentStatus to `Suspended` if the status is `Suspending`.
    // Transition the JobDeploymentStatus to `New` if the status is `Restarting`.
    rayJobInstance.Status.JobDeploymentStatus = ...
andrewsykim commented 2 weeks ago

@kevin85421 the reason I didn't use suspend is that I don't think it will play nicely with Kueue because suspend means each retry goes to the back of the job queue instead of immediate retry on failure

andrewsykim commented 2 weeks ago

Actually,I think I misunderstood what you said, we can use the same state machine logic for suspend without actually suspending the RayJob, will update the PR to do that. I'll also move restartRayJobOnFailure into the switch statement

andrewsykim commented 2 weeks ago

@kevin85421 I incorporated your feedback, if the overall approach looks good to you I can polish the PR and add tests.

So far I've manually verified this works:

$ kubectl get rayjob -w                                                                                                              
NAME            JOB STATUS   DEPLOYMENT STATUS   RAY CLUSTER NAME                 START TIME             END TIME   AGE                                                                                            
rayjob-sample                Initializing        rayjob-sample-raycluster-qhdx2   2024-06-13T14:42:08Z              6s                                                                                             
rayjob-sample                Running             rayjob-sample-raycluster-qhdx2   2024-06-13T14:42:08Z              76s                                                                                            
rayjob-sample   PENDING      Running             rayjob-sample-raycluster-qhdx2   2024-06-13T14:42:08Z              78s                                                                                            
rayjob-sample   FAILED       Failed              rayjob-sample-raycluster-qhdx2   2024-06-13T14:42:08Z   2024-06-13T14:43:32Z   84s                                                                                
rayjob-sample   FAILED       Restarting          rayjob-sample-raycluster-qhdx2   2024-06-13T14:42:08Z   2024-06-13T14:43:32Z   84s                                                                                
rayjob-sample                                                                     2024-06-13T14:42:08Z   2024-06-13T14:43:32Z   84s
rayjob-sample                Initializing        rayjob-sample-raycluster-mqt2l   2024-06-13T14:43:32Z   2024-06-13T14:43:32Z   84s
rayjob-sample                Running             rayjob-sample-raycluster-mqt2l   2024-06-13T14:43:32Z   2024-06-13T14:43:32Z   110s
rayjob-sample   PENDING      Running             rayjob-sample-raycluster-mqt2l   2024-06-13T14:43:32Z   2024-06-13T14:43:32Z   112s
rayjob-sample   FAILED       Failed              rayjob-sample-raycluster-mqt2l   2024-06-13T14:43:32Z   2024-06-13T14:44:09Z   2m1s
rayjob-sample   FAILED       Restarting          rayjob-sample-raycluster-mqt2l   2024-06-13T14:43:32Z   2024-06-13T14:44:09Z   2m1s
rayjob-sample                                                                     2024-06-13T14:43:32Z   2024-06-13T14:44:09Z   2m1s
rayjob-sample                Initializing        rayjob-sample-raycluster-x576d   2024-06-13T14:44:09Z   2024-06-13T14:44:09Z   2m1s
rayjob-sample                Running             rayjob-sample-raycluster-x576d   2024-06-13T14:44:09Z   2024-06-13T14:44:09Z   2m27s
rayjob-sample   PENDING      Running             rayjob-sample-raycluster-x576d   2024-06-13T14:44:09Z   2024-06-13T14:44:09Z   2m28s
rayjob-sample   RUNNING      Running             rayjob-sample-raycluster-x576d   2024-06-13T14:44:09Z   2024-06-13T14:44:09Z   2m34s
rayjob-sample   FAILED       Failed              rayjob-sample-raycluster-x576d   2024-06-13T14:44:09Z   2024-06-13T14:44:45Z   2m37s
rayjob-sample   FAILED       Restarting          rayjob-sample-raycluster-x576d   2024-06-13T14:44:09Z   2024-06-13T14:44:45Z   2m37s
rayjob-sample                                                                     2024-06-13T14:44:09Z   2024-06-13T14:44:45Z   2m37s
rayjob-sample                Initializing        rayjob-sample-raycluster-89284   2024-06-13T14:44:45Z   2024-06-13T14:44:45Z   2m37s
rayjob-sample                Running             rayjob-sample-raycluster-89284   2024-06-13T14:44:45Z   2024-06-13T14:44:45Z   3m3s
rayjob-sample   PENDING      Running             rayjob-sample-raycluster-89284   2024-06-13T14:44:45Z   2024-06-13T14:44:45Z   3m5s
rayjob-sample   FAILED       Failed              rayjob-sample-raycluster-89284   2024-06-13T14:44:45Z   2024-06-13T14:45:22Z   3m14s
rayjob-sample   FAILED       Restarting          rayjob-sample-raycluster-89284   2024-06-13T14:44:45Z   2024-06-13T14:45:22Z   3m14s
rayjob-sample                                                                     2024-06-13T14:44:45Z   2024-06-13T14:45:22Z   3m14s
rayjob-sample                Initializing        rayjob-sample-raycluster-xdgv9   2024-06-13T14:45:22Z   2024-06-13T14:45:22Z   3m14s