Closed artem-zinnatullin closed 4 years ago
thanks for the PR. New day, new edge-case :exploding_head: Can you please add a testcase for this
Added tests, sorry if my Go programming is a bit off, it's not my daily language :)
Added another test case to cover your very reasonable question, improved test naming to reflect what's going on better
great, thank you!
Thanks for the review and a quick release! :)
I've found that we have 2 kinds of jobs kept hanging for weeks in our cluster:
1) The ones that exceeded their deadline
DeadlineExceeded
, the problem in this case is that k8s does not setjob.Status.Failed
so the solution is to check that there is a job condition of typeFailed
and statetrue
Real example of a Job JSON stuck in this state:
2) The ones that have finished successfully, but for some reason their
job.Status.Active
is still1
, even though there are no related pods. I suspect it's a bug in k8s, maybe a race with job completion and pod being deleted by this controller (just a guess). The solution here is to rely onjobFinishTime
rather thanStatus.active
,jobFinishTime
is non-zero only and only if Job either completed successfully or failed.Real example of a Job JSON stuck in this state:
I've verified the build with these changes with dry-run and then in prod mode in our cluster and it successfully deleted hanging jobs while not touching the running ones.