When `restartPolicy` is `ExitCode` and a pod is deleted (137), the entire TFJob will still be marked as failed.

rllin commented 2 years ago

Expectations: if a replica has restartPolicy=ExitCode, then a pod deletion (triggers 137) should cause that pod to restart without triggering TFJob failure. Reality: The entire TFJob fails.

However, the correct behavior happens for OnFailure where the pod is properly restarted without the entire job failing.

How to reproduce:

Take below spec [0], and apply.
Delete the evaluator pod.
See that the entire job fails.
Replace the evaluator pod's restartPolicy with OnFailure, repeat steps 1,2 and see that the pod restarts without failing the job.

Suspected issue (may not be core issue):

Every replica type is checked for failures
Regardless of replica type, if the replica has a failure and if the job is not restarting, it will be marked for job failure
Seems like that L513 block should also NOT trigger if the jobStatus.Conditions is running?

[0] TFJob spec.

apiVersion: kubeflow.org/v1
kind: TFJob
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          affinity: {}
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi
    Evaluator:
      replicas: 1
      restartPolicy: ExitCode
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command: [ "/bin/bash", "-c", "--" ]
            args: [ "while true; do sleep 30; done;" ]
            image: busybox
            imagePullPolicy: Always
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
              protocol: TCP
            resources:
              limits:
                cpu: "2"
                memory: 10Gi

rllin commented 2 years ago

extra information:

Chief works as expected:

set restartPolicy of Chief to ExitCode
delete chief pod
job is still running and chief pod comes back as intended

Worker however does not work either like Evaluator

gaocegege commented 2 years ago

/cc @zw0610 @kubeflow/wg-training-leads

cheimu commented 2 years ago

Hi @rllin,

For busybox, I use /bin/sh to make it run, otherwise pod will get a containerCannotRun error.

It is unrelated to the main issue.

rllin commented 2 years ago

@cheimu sorry, that's my bad, i had to replace an internal command with the sleep so had not doublechecked the command properly.

cheimu commented 2 years ago

Well, it's a very tricky problem. It took me a very long time to figure it out.

call chain is following:

jc.Controller.ReconcilePods(metaObject, &jobStatus, pods, rtype, spec, replicas) -> commonutil.UpdateJobConditions() now, tfjob has a restarting condition.
jc.Controller.UpdateJobStatus(job, replicas, &jobStatus) -> commonutil.UpdateJobConditions() HERE, WE GOT THE PROBLEM:

We got a for loop here, for @rllin cases, we will iterate 2 times, chief first, and then evaluator: at loop 1, conds are: creating and restarting, and rtype is chief, so it will "append" a new running cond to job as shown in figure. HOWEVER, for this case, it is not append at all.

commonutil.UpdateJobConditions() -> setCondition(jobStatus, condition) -> newConditions := filterOutCondition(status.Conditions, condition.Type): In our case, existing conds are : creating and restarting, and the new one is running, which will be continued (as shown in figure), so restarting cond will be lost. And the new job conds become creating and running, then at loop 2 (when for evaluator), because there is a failed pod with running cond, so jobstatus will become failed as well

Here the job is set to failed.

https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/tensorflow/tfjob_controller.go#L526, and the whole job is failed....

@zw0610 @gaocegege Hi experts, am I correct about filterOutCondition()? I think it is confusing...

pavanky commented 2 years ago

This is inline with what we think is the issue. But we weren't sure if the bugfix was to be in the UpdateJobConditions or if line 504 in the second loop needs to check for JobRunning.

cheimu commented 2 years ago

This is inline with what we think is the issue. But we weren't sure if the bugfix was to be in the UpdateJobConditions or if line 504 in the second loop needs to check for JobRunning.

yeah, probably. I've tried to record the restarting cond first, then append it back, it passed the unit tests and worked for this situation. Please discuss to see if there is a better way to handle it elegantly! : D

gaocegege commented 2 years ago

/cc @jian-he

Seems that the code is from @jian-he, would you mind having a look at this issue?

rllin commented 2 years ago

@cheimu thanks for the quick turnaround!

johnugeorge commented 1 year ago

Fixed by https://github.com/kubeflow/training-operator/pull/1562

kubeflow / common

When `restartPolicy` is `ExitCode` and a pod is deleted (137), the entire TFJob will still be marked as failed. #186