ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
1.16k stars 373 forks source link

Make k8s job backoff limit configurable for RayJob #2091

Closed jjyao closed 5 months ago

jjyao commented 5 months ago

Why are these changes needed?

Allow users to specify BackoffLimit of the submitter k8s job of a RayJob

Related issue number

Closes #2058

Checks

jjyao commented 5 months ago

cc @andrewsykim @kevin85421 updated based on our discussion: could you take a look at the new configs. If it looks good, I'll polish the PR.

kevin85421 commented 5 months ago

The new CRD looks good to me.

andrewsykim commented 5 months ago

@kevin85421 @jjyao what do you think about this API, which also addresses some feature requests in https://github.com/ray-project/kuberay/issues/1902

spec:
  retryConfig:
    policy: RetryWithSameSubmissionID # future values: RetryWithNewSubmissionID and RetryWithNewCluster 
    backOffLimit: 2

(or something like this)