Fix deletion of failed jobs

artem-zinnatullin commented 4 years ago

Noticed that controller was not deleting failed jobs in our environment.

This PR fixes finish time lookup for failed jobs.

Few examples of failed job statuses:

"status": {
    "conditions": [
        {
            "lastProbeTime": "2020-06-04T02:23:44Z",
            "lastTransitionTime": "2020-06-04T02:23:44Z",
            "message": "Job has reached the specified backoff limit",
            "reason": "BackoffLimitExceeded",
            "status": "True",
            "type": "Failed"
        }
    ],
    "failed": 1,
    "startTime": "2020-06-04T02:22:30Z"
}

"status": {
        "conditions": [
            {
                "lastProbeTime": "2020-06-03T03:51:11Z",
                "lastTransitionTime": "2020-06-03T03:51:11Z",
                "message": "Job was active longer than specified deadline",
                "reason": "DeadlineExceeded",
                "status": "True",
                "type": "Failed"
            }
        ],
        "startTime": "2020-06-03T03:49:44Z"
    }

Confirmed that it deleted all failed jobs in our cluster after this patch.

artem-zinnatullin commented 4 years ago

@lwolf do you mind taking a look at this one? Would love to get off the fork in our setup :)

lwolf commented 4 years ago

yes, sorry for the delay, wanted to test it first on my cluster.

lwolf commented 4 years ago

The change makes sense, but I don't understand why current version works for me.

I created 2 jobs, default pi and pi with intentionally broken command. Statuses:

status:
  completionTime: "2020-06-08T18:48:50Z"
  conditions:
  - lastProbeTime: "2020-06-08T18:48:50Z"
    lastTransitionTime: "2020-06-08T18:48:50Z"
    status: "True"
    type: Complete
  startTime: "2020-06-08T18:48:36Z"
  succeeded: 1

status:
  conditions:
  - lastProbeTime: "2020-06-08T18:52:06Z"
    lastTransitionTime: "2020-06-08T18:52:06Z"
    message: Job has reached the specified backoff limit
    reason: BackoffLimitExceeded
    status: "True"
    type: Failed
  failed: 4
  startTime: "2020-06-08T18:50:11Z"

Meantime cleanup-operator in dry-run mode prints

2020/06/08 18:50:47 Listening for changes...
2020/06/08 18:51:47 dry-run: Job 'default:pi' would have been deleted
2020/06/08 18:51:47 dry-run: Pod 'default:pi-failed-x5jl6' would have been deleted
2020/06/08 18:51:47 dry-run: Pod 'default:pi-zl7h5' would have been deleted
2020/06/08 18:51:47 dry-run: Pod 'default:pi-failed-zdlrx' would have been deleted
2020/06/08 18:51:47 dry-run: Pod 'default:pi-failed-p9xmm' would have been deleted
2020/06/08 18:51:47 dry-run: Pod 'default:pi-failed-9khhp' would have been deleted
2020/06/08 18:52:47 dry-run: Job 'default:pi' would have been deleted

lwolf commented 4 years ago

do you mind sharing your k8s version?

artem-zinnatullin commented 4 years ago

Sure, I'm running AWS EKS 1.15

From your example I don't see completionTime in the failed job, so I'm not sure how the code without this patch would handle deletion hmm. It would bail out on the if condition that checks if completionTime is not zero I think?

lwolf commented 4 years ago

Thanks.

That's what surprising me as well. I will try to debug it tonight and post the update.

It would bail out on the if condition that checks if completionTime is not zero I think?

yes it should.

artem-zinnatullin commented 4 years ago

@lwolf can you please post your failed jobs as JSON? Just to make sure that what controller works with is the same as what you see via kubectl

(I'm sure you're well aware on how to do it, I'm just really confused on how it works in your case…)

This is how I dumped my examples:

kubectl get job -n mynamespace myjob -o json > ~/Desktop/fail.json

lwolf commented 4 years ago

Sorry for the confusion, old code didn't work properly. Output of dry-run was cleaning only failed pods, not the job.

Merging it

artem-zinnatullin commented 4 years ago

Oh, indeed, I also didn't notice that your log had only deletion of the failed pods, not the jobs 😅

Thanks for merging and releasing a new version!

lwolf / kube-cleanup-operator

Fix deletion of failed jobs #48