Closed peterghaddad closed 2 months ago
Currently when KubeRay creates a new Ray Job, the k8s job that is created allows the pod to retry up to three times.
The field to have configurable is backoffLimit: 2 in the k8s job spec.
backoffLimit: 2
This allows jobs to fail fast.
Allow users to configure retry logic for K8s jobs. See below example when a job fails due to it having the same job submission name when it retries.
kind: Job apiVersion: batch/v1 spec: parallelism: 1 completions: 1 backoffLimit: 2 status: conditions: - type: Failed status: 'True' lastProbeTime: '2024-04-01T11:34:18Z' lastTransitionTime: '2024-04-01T11:34:18Z' reason: BackoffLimitExceeded message: Job has reached the specified backoff limit startTime: '2024-04-01T11:29:25Z' failed: 3
No response
Search before asking
Description
Currently when KubeRay creates a new Ray Job, the k8s job that is created allows the pod to retry up to three times.
The field to have configurable is
backoffLimit: 2
in the k8s job spec.This allows jobs to fail fast.
Use case
Allow users to configure retry logic for K8s jobs. See below example when a job fails due to it having the same job submission name when it retries.
Related issues
No response
Are you willing to submit a PR?