kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

When `restartPolicy` is `ExitCode` and a pod is deleted (137), the entire TFJob will still be marked as failed. #186

Closed rllin closed 1 year ago

rllin commented 2 years ago

Expectations: if a replica has restartPolicy=ExitCode, then a pod deletion (triggers 137) should cause that pod to restart without triggering TFJob failure. Reality: The entire TFJob fails.

However, the correct behavior happens for OnFailure where the pod is properly restarted without the entire job failing.

How to reproduce:

  1. Take below spec [0], and apply.
  2. Delete the evaluator pod.
  3. See that the entire job fails.
  4. Replace the evaluator pod's restartPolicy with OnFailure, repeat steps 1,2 and see that the pod restarts without failing the job.

Suspected issue (may not be core issue):

[0] TFJob spec.

apiVersion: kubeflow.org/v1
kind: TFJob
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          affinity: {}
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi
    Evaluator:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi
rllin commented 2 years ago

extra information:

Chief works as expected:

  1. set restartPolicy of Chief to ExitCode
  2. delete chief pod
  3. job is still running and chief pod comes back as intended

Worker however does not work either like Evaluator

gaocegege commented 2 years ago

/cc @zw0610 @kubeflow/wg-training-leads

cheimu commented 2 years ago

Hi @rllin,

image

For busybox, I use /bin/sh to make it run, otherwise pod will get a containerCannotRun error.

It is unrelated to the main issue.

rllin commented 2 years ago

@cheimu sorry, that's my bad, i had to replace an internal command with the sleep so had not doublechecked the command properly.

cheimu commented 2 years ago

Well, it's a very tricky problem. It took me a very long time to figure it out.

call chain is following:

  1. jc.Controller.ReconcilePods(metaObject, &jobStatus, pods, rtype, spec, replicas) -> commonutil.UpdateJobConditions() now, tfjob has a restarting condition.
  2. jc.Controller.UpdateJobStatus(job, replicas, &jobStatus) -> commonutil.UpdateJobConditions() HERE, WE GOT THE PROBLEM:

We got a for loop here, for @rllin cases, we will iterate 2 times, chief first, and then evaluator: at loop 1, conds are: creating and restarting, and rtype is chief, so it will "append" a new running cond to job as shown in figure. HOWEVER, for this case, it is not append at all.

image
  1. commonutil.UpdateJobConditions() -> setCondition(jobStatus, condition) -> newConditions := filterOutCondition(status.Conditions, condition.Type): In our case, existing conds are : creating and restarting, and the new one is running, which will be continued (as shown in figure), so restarting cond will be lost. And the new job conds become creating and running, then at loop 2 (when for evaluator), because there is a failed pod with running cond, so jobstatus will become failed as well image

Here the job is set to failed.

image

https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/tensorflow/tfjob_controller.go#L526, and the whole job is failed....

@zw0610 @gaocegege Hi experts, am I correct about filterOutCondition()? I think it is confusing...

pavanky commented 2 years ago

This is inline with what we think is the issue. But we weren't sure if the bugfix was to be in the UpdateJobConditions or if line 504 in the second loop needs to check for JobRunning.

cheimu commented 2 years ago

This is inline with what we think is the issue. But we weren't sure if the bugfix was to be in the UpdateJobConditions or if line 504 in the second loop needs to check for JobRunning.

yeah, probably. I've tried to record the restarting cond first, then append it back, it passed the unit tests and worked for this situation. Please discuss to see if there is a better way to handle it elegantly! : D

gaocegege commented 2 years ago

/cc @jian-he

Seems that the code is from @jian-he, would you mind having a look at this issue?

rllin commented 2 years ago

@cheimu thanks for the quick turnaround!

johnugeorge commented 1 year ago

Fixed by https://github.com/kubeflow/training-operator/pull/1562