ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
982 stars 330 forks source link

[RayJob] Unified checkBackoffLimitAndUpdateStatusIfNeeded codepath and add an e2e test for retry #2215

Closed kevin85421 closed 1 day ago

kevin85421 commented 3 days ago

Why are these changes needed?

Unified the code path of checkBackoffLimitAndUpdateStatusIfNeeded to avoid forgetting to call the function when we have a new case to fail the RayJob in the future.

Related issue number

Checks

I manually tested the case that RayJob exceeded ActiveDeadlineSeconds.

{"level":"info","ts":"2024-07-02T17:02:22.504Z","logger":"controllers.RayJob","msg":"RayJob is not eligible for retry due to failure with DeadlineExceeded","RayJob":{"name":"rayjob-sample","namespace":"default"},"reconcileID":"6d29d711-507a-4f21-b8de-d5a45bf394cf","backoffLimit":2,"succeeded":0,"failed":1}
Screenshot 2024-07-02 at 10 07 20 AM
kevin85421 commented 3 days ago

cc @andrewsykim would you mind taking a look? Thanks!

kevin85421 commented 2 days ago

Update the PR description for more details about tests.