ray-project / kuberay

A toolkit to run Ray applications on Kubernetes
Apache License 2.0
990 stars 330 forks source link

[Feature] Support k8s job backoff limit configuration for KubeRay jobs #2058

Closed peterghaddad closed 2 months ago

peterghaddad commented 3 months ago

Search before asking

Description

Currently when KubeRay creates a new Ray Job, the k8s job that is created allows the pod to retry up to three times.

The field to have configurable is backoffLimit: 2 in the k8s job spec.

This allows jobs to fail fast.

Use case

Allow users to configure retry logic for K8s jobs. See below example when a job fails due to it having the same job submission name when it retries.

kind: Job
apiVersion: batch/v1
spec:
  parallelism: 1
  completions: 1
  backoffLimit: 2
status:
  conditions:
    - type: Failed
      status: 'True'
      lastProbeTime: '2024-04-01T11:34:18Z'
      lastTransitionTime: '2024-04-01T11:34:18Z'
      reason: BackoffLimitExceeded
      message: Job has reached the specified backoff limit
  startTime: '2024-04-01T11:29:25Z'
  failed: 3

Related issues

No response

Are you willing to submit a PR?