Closed artem-zinnatullin closed 4 years ago
@lwolf do you mind taking a look at this one? Would love to get off the fork in our setup :)
yes, sorry for the delay, wanted to test it first on my cluster.
The change makes sense, but I don't understand why current version works for me.
I created 2 jobs, default pi and pi with intentionally broken command. Statuses:
status:
completionTime: "2020-06-08T18:48:50Z"
conditions:
- lastProbeTime: "2020-06-08T18:48:50Z"
lastTransitionTime: "2020-06-08T18:48:50Z"
status: "True"
type: Complete
startTime: "2020-06-08T18:48:36Z"
succeeded: 1
status:
conditions:
- lastProbeTime: "2020-06-08T18:52:06Z"
lastTransitionTime: "2020-06-08T18:52:06Z"
message: Job has reached the specified backoff limit
reason: BackoffLimitExceeded
status: "True"
type: Failed
failed: 4
startTime: "2020-06-08T18:50:11Z"
Meantime cleanup-operator in dry-run mode prints
2020/06/08 18:50:47 Listening for changes...
2020/06/08 18:51:47 dry-run: Job 'default:pi' would have been deleted
2020/06/08 18:51:47 dry-run: Pod 'default:pi-failed-x5jl6' would have been deleted
2020/06/08 18:51:47 dry-run: Pod 'default:pi-zl7h5' would have been deleted
2020/06/08 18:51:47 dry-run: Pod 'default:pi-failed-zdlrx' would have been deleted
2020/06/08 18:51:47 dry-run: Pod 'default:pi-failed-p9xmm' would have been deleted
2020/06/08 18:51:47 dry-run: Pod 'default:pi-failed-9khhp' would have been deleted
2020/06/08 18:52:47 dry-run: Job 'default:pi' would have been deleted
do you mind sharing your k8s version?
Sure, I'm running AWS EKS 1.15
From your example I don't see completionTime in the failed job, so I'm not sure how the code without this patch would handle deletion hmm. It would bail out on the if condition that checks if completionTime is not zero I think?
Thanks.
That's what surprising me as well. I will try to debug it tonight and post the update.
It would bail out on the if condition that checks if completionTime is not zero I think?
yes it should.
@lwolf can you please post your failed jobs as JSON? Just to make sure that what controller works with is the same as what you see via kubectl
(I'm sure you're well aware on how to do it, I'm just really confused on how it works in your case…)
This is how I dumped my examples:
kubectl get job -n mynamespace myjob -o json > ~/Desktop/fail.json
Sorry for the confusion, old code didn't work properly. Output of dry-run was cleaning only failed pods, not the job.
Merging it
Oh, indeed, I also didn't notice that your log had only deletion of the failed pods, not the jobs 😅
Thanks for merging and releasing a new version!
Noticed that controller was not deleting failed jobs in our environment.
This PR fixes finish time lookup for failed jobs.
Few examples of failed job statuses:
Confirmed that it deleted all failed jobs in our cluster after this patch.