lwolf / kube-cleanup-operator

Kubernetes Operator to automatically delete completed Jobs and their Pods
MIT License
498 stars 109 forks source link

Delete hanging jobs (DeadlineExceeded, Active == 1 but actually completed) #55

Closed artem-zinnatullin closed 4 years ago

artem-zinnatullin commented 4 years ago

I've found that we have 2 kinds of jobs kept hanging for weeks in our cluster:

1) The ones that exceeded their deadline DeadlineExceeded, the problem in this case is that k8s does not set job.Status.Failed so the solution is to check that there is a job condition of type Failed and state true

Real example of a Job JSON stuck in this state:

{
    "status": {
        "conditions": [
            {
                "lastProbeTime": "2020-07-23T17:50:10Z",
                "lastTransitionTime": "2020-07-23T17:50:10Z",
                "message": "Job was active longer than specified deadline",
                "reason": "DeadlineExceeded",
                "status": "True",
                "type": "Failed"
            }
        ],
        "startTime": "2020-07-23T17:40:10Z"
    }
}

2) The ones that have finished successfully, but for some reason their job.Status.Active is still 1, even though there are no related pods. I suspect it's a bug in k8s, maybe a race with job completion and pod being deleted by this controller (just a guess). The solution here is to rely on jobFinishTime rather than Status.active, jobFinishTime is non-zero only and only if Job either completed successfully or failed.

Real example of a Job JSON stuck in this state:

{
    "status": {
        "active": 1,
        "completionTime": "2020-06-19T02:43:35Z",
        "conditions": [
            {
                "lastProbeTime": "2020-06-19T02:43:35Z",
                "lastTransitionTime": "2020-06-19T02:43:35Z",
                "status": "True",
                "type": "Complete"
            }
        ],
        "startTime": "2020-06-19T02:17:58Z",
        "succeeded": 1
    }
}

I've verified the build with these changes with dry-run and then in prod mode in our cluster and it successfully deleted hanging jobs while not touching the running ones.

lwolf commented 4 years ago

thanks for the PR. New day, new edge-case :exploding_head: Can you please add a testcase for this

artem-zinnatullin commented 4 years ago

Added tests, sorry if my Go programming is a bit off, it's not my daily language :)

artem-zinnatullin commented 4 years ago

Added another test case to cover your very reasonable question, improved test naming to reflect what's going on better

lwolf commented 4 years ago

great, thank you!

artem-zinnatullin commented 4 years ago

Thanks for the review and a quick release! :)