kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.09k stars 39.4k forks source link

Cronjobs - failedJobsHistoryLimit not reaping state `Error` #53331

Open civik opened 6 years ago

civik commented 6 years ago

/kind bug /sig apps

Cronjob limits were defined in #52390 - however it doesn't appear that failedJobsHistoryLimit will reap cronjob pods that end up in a state of Error

 kubectl get pods --show-all | grep cronjob | grep Error | wc -l
 566

Cronjob had failedJobsHistoryLimit set to 2

Environment:

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-15T08:51:09Z", GoVersion:"go1.9", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.4", GitCommit:"d6f433224538d4f9ca2f7ae19b252e6fcb66a3ae", GitTreeState:"clean", BuildDate:"2017-05-19T18:33:17Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}

Centos7.3

4.4.83-1.el7.elrepo.x86_64 #1 SMP Thu Aug 17 09:03:51 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

civik commented 6 years ago

/sig apps

dims commented 6 years ago

cc @soltysh

soltysh commented 6 years ago

@civik @imiskolee can you folks provide a situation where your pod failed in a cronjob. I'm specifically interested the phase of the pod (see official docs). The only one that comes to my mind of top of my head is when I specify wrong image, for example. In that case the pod is not failed, but pending. Which means none of the controllers (job nor cronjob) can qualify this execution as a failed one and do anything about it. So no removal can happen actually.

There are a few possible approaches to this problem:

  1. Set activeDeadlineSeconds for a job, which will fail a job after its exceeded its duration
  2. Ensure backoffLimit is set, which is responsible for retries after which a job is failed. Although in the particular example I gave (with the wrong pullspec) this won't help.

Personally, I try to combine the two usually for tighter control.

soltysh commented 6 years ago

I've also created https://github.com/kubernetes/kubernetes/issues/58384 to discuss the start timeout for a job.

civik commented 6 years ago

@soltysh Thanks for the update. I'm thinking the issues I'm seeing are due to jobs that will create another pod with restartPolicy set to OnFailure or Always that go into CrashBackLoop. The job will happily keep stamping out pods that sit in a restart loop. Is there some sort of timer that could be set on the parent job that could kill anything it created on a failure?

soltysh commented 6 years ago

@civik iiuc your job is creating another pod, in which case there's no controller owning your pod. In that case you have two options:

  1. set the OwnerRef, but that will only remove the pod when the owning pod/job is being removed
  2. manually clean your pods
mcronce commented 6 years ago

I'm seeing this happen as well (1.7.3) - successfulJobsHistoryLimit (set to 2) works fine, but failedJobsHistoryLimit (set to 5) will end up with hundreds of pods in CrashLoopBackOff until eventually it hits my nodes' resource limits and then they just stack up in Pending

soltysh commented 6 years ago

Pending pods are not failed ones and thus the controller won't be able to clean them.

KIVagant commented 6 years ago

Same problem for me: I've got ~8000 pods in state "Error" when failedJobsHistoryLimit was set to 5. The cronjob had wrong environment variable so containers were failed trying to start in application level. From the K8s side the configuration was ok, but internal application error led to this situation.

mcronce commented 6 years ago

@soltysh Correct - however, it should be reaping the ones in Error and CrashLoopBackOff. If it does that correctly, the cluster's resource limits aren't exhausted and they never stack up in Pending.

soltysh commented 6 years ago

@KIVagant @mcronce can you give me the yaml of the pod status you're having in Error state? For CrashLoopBackOff is specific, but unfortunately it does not give definite answer that the pod failed. If you look carefully through pod status you'll see it's in the waiting state, scheduled, initialized and waiting for further actions. Nowhere in the code we have any kind of special casing for situation such as this one, I'm hesitant on adding those to job controller as well. I'll try to bring this discussion for the next sig-apps and see what's the outcome.

mcronce commented 6 years ago

@soltysh Right now I don't have any, I've been manually clearing them with a little bash one-liner for a while. Next time I experience it, though, I'll grab the YAML and paste it here. Thanks!

KIVagant commented 6 years ago

Same for me, I've already fixed the root cause for the failed pods and cleared all of them. I can reproduce the situation, but right now I have much bigger problem with cluster and kops, so maybe later.

garethlewin commented 6 years ago

@soltysh here are the results of describe and get -o yaml with a bunch of stuff removed (tried to keep just what is relevant)

Status:         Failed
Containers:
  kollector:
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 17 Apr 2018 04:50:26 -0700
      Finished:     Tue, 17 Apr 2018 04:50:27 -0700
    Ready:          False
    Restart Count:  0
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.alpha.kubernetes.io/notReady:NoExecute for 300s
                 node.alpha.kubernetes.io/unreachable:NoExecute for 300s
apiVersion: v1
kind: Pod
metadata:
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
spec:
  containers:
  restartPolicy: Never
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2018-04-17T11:50:25Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2018-04-17T11:50:25Z
    message: 'containers with unready status: [kollector]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2018-04-17T11:50:25Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://7f9ca3488d4e714f1264620b2385cbf2b8ced40de26e6f5a0ec22e73385701ed
    lastState: {}
    name: kollector
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: docker://7f9ca3488d4e714f1264620b2385cbf2b8ced40de26e6f5a0ec22e73385701ed
        exitCode: 1
        finishedAt: 2018-04-17T11:50:27Z
        reason: Error
        startedAt: 2018-04-17T11:50:26Z
  phase: Failed
  qosClass: Burstable
  startTime: 2018-04-17T11:50:25Z
soltysh commented 6 years ago

Apparently there's https://github.com/kubernetes/kubernetes/issues/62382 which I fixed in https://github.com/kubernetes/kubernetes/pull/63650. Maybe you're hitting that?

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

civik commented 6 years ago

/remove-lifecycle stale

I think this might be still an active issue impacting operators. Can anyone confirm if this was fixed by #63650? I dont have an environment in which to test this right now.

soltysh commented 6 years ago

@civik nope, the linked PR is for handling backoffs, not to address problems with Error state, which is far more complicated, like I said before.

sidewinder12s commented 6 years ago

Ya I think we've run into this issue as well with pods in Error state (3000+). I think that the error we'd see (This was a developers cron as opposed to something from the operator side) is that the cronjob config within the container would be incorrect so the image would pull, get started and then error out.

I'm also not sure if activeDeadlineSeconds would help with this case. The general pattern we've seen is that the containers would start, the cronjob would fail, enter failed state and then more would get spun up. The containers are in failed state so they wouldn't get cleaned up and we'd run out of pods on our nodes.

We're looking at adding a Pod limit to the namespace to contain the issue in the meantime.

soltysh commented 6 years ago

activeDeadlineSeconds should kill the job upon exceeding this time, iow. failing it and being available to be removed.

sidewinder12s commented 6 years ago

@soltysh Thanks for the tip, does activeDeadlineSeconds behave any differently between how it's defined in batch spec vs pod spec?

I'll give it a try adding that to our batch spec to see if it solves the issue of 1000s of pods for us.

It would appear this isn't very intuitive and has burned quite a few people, should this perhaps have a defined default to keep people from shooting themselves in the foot? If not that the entire way that the various timeouts/reaping/etc. for cronjobs appears to be very confusing either way.

Also here is our job spec that goes into error on 1.10.4.

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: routing-process-csv--stage
spec:
  # Every 0700 UTC (0000 Pacific Time)
  schedule: "0 7 * * *"
  successfulJobsHistoryLimit: 1
  suspend: false
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: routing-process-csv--stage
            image: <URL>:2012/atm-cron-jobs-base
            args:
            - curl -v -X GET <API Endpoint/webhook>
            imagePullPolicy: Always
          imagePullSecrets:
            - name: <redacted>-2012
          restartPolicy: Never
      backoffLimit: 3
fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/53331#issuecomment-466683619): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
madsonic commented 5 years ago

hmm seems to be still happening I have a container that doesn't exit and another getting exit code 1 resulting in Error but these pods doesn't get reaped

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be2
8623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform
:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.7-gke.25", GitCommit:"d4c79083ab6dea5d26ef4
ed8d50b145268349bc3", GitTreeState:"clean", BuildDate:"2019-06-22T16:10:31Z", GoVersion:"go1.10.8b4", Compiler:"gc
", Platform:"linux/amd64"}
mrak commented 5 years ago

/reopen

We are still seeing this on 1.13 and 1.14

k8s-ci-robot commented 5 years ago

@mrak: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/kubernetes/issues/53331#issuecomment-528606732): >/reopen > >We are still seeing this on `1.13` and `1.14` Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
2rs2ts commented 5 years ago

/reopen

Seeing this on 1.14.6 ATM

k8s-ci-robot commented 5 years ago

@2rs2ts: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/kubernetes/issues/53331#issuecomment-528612812): >/reopen > >Seeing this on 1.14.6 ATM Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
2rs2ts commented 5 years ago

@civik can you reopen this?

gitnik commented 4 years ago

Hey guys, what's the state of this issue? Since it wasn't reopened I am assuming it was maybe fixed? But we are still seeing this issue in 1.14

2rs2ts commented 4 years ago

It was probably not fixed, people just ghost on their own issues :/

vincent-pli commented 4 years ago

Seems the issue still there: https://github.com/kubernetes/kubernetes/blob/7766e65a1bc2f85da79c6a30936137dfbaf37fb7/pkg/controller/cronjob/cronjob_controller.go#L162-L168

2rs2ts commented 4 years ago

Should I file a duplicate issue since the OP has not reopened the issue?

alejandrox1 commented 4 years ago

reopening this because i see a lot of attempts to do so (only org members can use prow commands). /reopen

k8s-ci-robot commented 4 years ago

@alejandrox1: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/53331#issuecomment-658446166): >reopening this because i see a lot of attempts to do so (only org members can use prow commands). >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
alejandrox1 commented 4 years ago

gonna freeze this until someone wants to volunteer to work on this /lifecycle frozen

ohthehugemanatee commented 4 years ago

I thought I ran into this with an easy-to-reproduce example... but in the end it validates that .spec.backoffLimit works as intended. I note that the other examples with information to reproduce all happen before default .spec.backoffLimit was introduced.

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: curl
spec:
  schedule: "0 * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: curl
            image: buildpack-deps:curl
            args:
            - /bin/sh
            - -ec
            - curl http://some-service
          restartPolicy: Never

I made a mistake and forgot that some-service is listening on port 3000, not 80. So curl fails to connect and times out. Came back in the morning and I had 5 empty pods in status Error. Looks like the default .spec.backoffLimit value worked just fine for me. I suggest that addition is why we see a sharp dropoff in interest in this issue.

For future developers who feel that they've run into this problem:

2rs2ts commented 4 years ago

The backoffLimit definitely has helped mitigate this but my company has 768 cronjobs in one of our production clusters :) It is a not-too-uncommon occurrence that we get support requests for cronjobs that haven't fired in a while because of this bug. We're on 1.17.8 now and we still get these requests from time to time.

soltysh commented 4 years ago

The problem with Error state as presented in kubectl is that these are usually jobs that are running. It's hard in the controller to speculate whether such error state is permanent or temporary. Unless there's a clear Failed signal, the controller won't be able to differentiate between those. So this is not quite a bug.

2rs2ts commented 3 years ago

We have jobs that don't restart when they get an error and they don't get reaped sometimes. So it does seem like a bug to me.

soltysh commented 3 years ago

Do you have an example yaml of such a failed pod?

2rs2ts commented 3 years ago

@soltysh if I find a repro case I will share it, however it'll be pretty heavily redacted (company secrets and all that) so I'm not sure how much help that'll be.

gtorre commented 3 years ago

This is an issue for us as well:

provisioner-supervise-1607355000-jpn2q          0/1     Error       0          44d
provisioner-supervise-1607355000-lj6lp          0/1     Error       0          44d
provisioner-supervise-1607355000-pjnkr          0/1     Error       0          44d
provisioner-supervise-1607355000-szlpd          0/1     Error       0          44d
provisioner-supervise-1607355000-vfh9z          0/1     Error       0          44d
provisioner-supervise-1607355000-z4rsx          0/1     Error       0          44d
provisioner-supervise-1607355000-zh9vx          0/1     Error       0          44d
provisioner-supervise-1608060600-2vcsd          0/1     Error       0          35d
provisioner-supervise-1608060600-kckfl          0/1     Error       0          35d
provisioner-supervise-1608060600-mdqgp          0/1     Error       0          35d
provisioner-supervise-1608060600-nlgsg          0/1     Error       0          35d
provisioner-supervise-1608060600-zbws7          0/1     Error       0          35d
provisioner-supervise-1608060600-zvgmc          0/1     Error       0          35d
provisioner-supervise-1611159000-dss9j          0/1     Completed   0          9m3s

Our cronjob spec looks like this:

spec:
  schedule: "*/10 * * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 2

Per @soltysh's request in a previous comment, here is the json output of a faileld pod:

{
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {
        "creationTimestamp": "2020-12-07T15:31:31Z",
        "generateName": "provisioner-supervise-1607355000-",
        "labels": {
            "controller-uid": "8aa58562-fc22-4782-b94e-a2dcb6071328",
            "job-name": "provisioner-supervise-1607355000"
        },
        "managedFields": [
            {
                "apiVersion": "v1",
                "fieldsType": "FieldsV1",
                "fieldsV1": {
                    "f:metadata": {
                        "f:generateName": {},
                        "f:labels": {
                            ".": {},
                            "f:controller-uid": {},
                            "f:job-name": {}
                        },
                        "f:ownerReferences": {
                            ".": {},
                            "k:{\"uid\":\"8aa58562-fc22-4782-b94e-a2dcb6071328\"}": {
                                ".": {},
                                "f:apiVersion": {},
                                "f:blockOwnerDeletion": {},
                                "f:controller": {},
                                "f:kind": {},
                                "f:name": {},
                                "f:uid": {}
                            }
                        }
                    },
                    "f:spec": {
                        "f:containers": {
                            "k:{\"name\":\"provisioner-supervise\"}": {
                                ".": {},
                                "f:args": {},
                                "f:image": {},
                                "f:imagePullPolicy": {},
                                "f:name": {},
                                "f:resources": {
                                    ".": {},
                                    "f:limits": {
                                        ".": {},
                                        "f:cpu": {},
                                        "f:memory": {}
                                    },
                                    "f:requests": {
                                        ".": {},
                                        "f:cpu": {},
                                        "f:memory": {}
                                    }
                                },
                                "f:terminationMessagePath": {},
                                "f:terminationMessagePolicy": {}
                            }
                        },
                        "f:dnsPolicy": {},
                        "f:enableServiceLinks": {},
                        "f:restartPolicy": {},
                        "f:schedulerName": {},
                        "f:securityContext": {},
                        "f:terminationGracePeriodSeconds": {}
                    }
                },
                "manager": "kube-controller-manager",
                "operation": "Update",
                "time": "2020-12-07T15:31:31Z"
            },
            {
                "apiVersion": "v1",
                "fieldsType": "FieldsV1",
                "fieldsV1": {
                    "f:status": {
                        "f:conditions": {
                            "k:{\"type\":\"ContainersReady\"}": {
                                ".": {},
                                "f:lastProbeTime": {},
                                "f:lastTransitionTime": {},
                                "f:message": {},
                                "f:reason": {},
                                "f:status": {},
                                "f:type": {}
                            },
                            "k:{\"type\":\"Initialized\"}": {
                                ".": {},
                                "f:lastProbeTime": {},
                                "f:lastTransitionTime": {},
                                "f:status": {},
                                "f:type": {}
                            },
                            "k:{\"type\":\"Ready\"}": {
                                ".": {},
                                "f:lastProbeTime": {},
                                "f:lastTransitionTime": {},
                                "f:message": {},
                                "f:reason": {},
                                "f:status": {},
                                "f:type": {}
                            }
                        },
                        "f:containerStatuses": {},
                        "f:hostIP": {},
                        "f:phase": {},
                        "f:podIP": {},
                        "f:podIPs": {
                            ".": {},
                            "k:{\"ip\":\"some_ip"}": {
                                ".": {},
                                "f:ip": {}
                            }
                        },
                        "f:startTime": {}
                    }
                },
                "manager": "kubelet",
                "operation": "Update",
                "time": "2020-12-07T15:31:42Z"
            }
        ],
        "name": "provisioner-supervise-1607355000-szlpd",
        "namespace": "some_namespace",
        "ownerReferences": [
            {
                "apiVersion": "batch/v1",
                "blockOwnerDeletion": true,
                "controller": true,
                "kind": "Job",
                "name": "provisioner-supervise-1607355000",
                "uid": "8aa58562-fc22-4782-b94e-a2dcb6071328"
            }
        ],
        "resourceVersion": "453999483",
        "selfLink": "/api/v1/namespaces/some_namespace/pods/provisioner-supervise-1607355000-szlpd",
        "uid": "9dab634b-e100-4847-b371-9125c65b615d"
    },
    "spec": {
        "containers": [
            {
                "args": [
                    "/bin/sh",
                    "-c",
                    "wget -SO - https://provisioner.example.net/endpoint"
                ],
                "image": "busybox",
                "imagePullPolicy": "Always",
                "name": "provisioner-supervise",
                "resources": {
                    "limits": {
                        "cpu": "500m",
                        "memory": "512Mi"
                    },
                    "requests": {
                        "cpu": "500m",
                        "memory": "512Mi"
                    }
                },
                "terminationMessagePath": "/dev/termination-log",
                "terminationMessagePolicy": "File",
                "volumeMounts": [
                    {
                        "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
                        "name": "some-token",
                        "readOnly": true
                    }
                ]
            }
        ],
        "dnsPolicy": "ClusterFirst",
        "enableServiceLinks": true,
        "nodeName": "kubernetes.example.net",
        "priority": 0,
        "restartPolicy": "Never",
        "schedulerName": "default-scheduler",
        "securityContext": {},
        "serviceAccount": "default",
        "serviceAccountName": "default",
        "terminationGracePeriodSeconds": 30,
        "tolerations": [
            {
                "effect": "NoExecute",
                "key": "node.kubernetes.io/not-ready",
                "operator": "Exists",
                "tolerationSeconds": 300
            },
            {
                "effect": "NoExecute",
                "key": "node.kubernetes.io/unreachable",
                "operator": "Exists",
                "tolerationSeconds": 300
            }
        ],
        "volumes": [
            {
                "name": "some-token",
                "secret": {
                    "defaultMode": 420,
                    "secretName": "some-token"
                }
            }
        ]
    },
    "status": {
        "conditions": [
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2020-12-07T15:31:31Z",
                "status": "True",
                "type": "Initialized"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2020-12-07T15:31:31Z",
                "message": "containers with unready status: [provisioner-supervise]",
                "reason": "ContainersNotReady",
                "status": "False",
                "type": "Ready"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2020-12-07T15:31:31Z",
                "message": "containers with unready status: [provisioner-supervise]",
                "reason": "ContainersNotReady",
                "status": "False",
                "type": "ContainersReady"
            },
            {
                "lastProbeTime": null,
                "lastTransitionTime": "2020-12-07T15:31:31Z",
                "status": "True",
                "type": "PodScheduled"
            }
        ],
        "containerStatuses": [
            {
                "containerID": "docker://e19bfd01a16a63761b4e3370752c54af2854ef4a9e0a4af6fb94a0bd85befa43",
                "image": "busybox:latest",
                "imageID": "docker-pullable://busybox@sha256:bde48e1751173b709090c2539fdf12d6ba64e88ec7a4301591227ce925f3c678",
                "lastState": {},
                "name": "provisioner-supervise",
                "ready": false,
                "restartCount": 0,
                "started": false,
                "state": {
                    "terminated": {
                        "containerID": "docker://e19bfd01a16a63761b4e3370752c54af2854ef4a9e0a4af6fb94a0bd85befa43",
                        "exitCode": 1,
                        "finishedAt": "2020-12-07T15:31:41Z",
                        "reason": "Error",
                        "startedAt": "2020-12-07T15:31:41Z"
                    }
                }
            }
        ],
        "hostIP": "some_ip",
        "phase": "Failed",
        "podIP": "some_ip",
        "podIPs": [
            {
                "ip": "some_ip"
            }
        ],
        "qosClass": "Guaranteed",
        "startTime": "2020-12-07T15:31:31Z"
    }
}
soltysh commented 3 years ago

I think we have a problem in the job controller, not cronjob controller. A somewhat similar situation as here is being described in https://github.com/kubernetes/kubernetes/issues/93783. In both cases job controller will indefinitely try to complete a job, but either due to error in the pod or other issues (quota, wrong pull spec, etc.) the pod will not start or will always fail. We would need a safety mechanism in the job controller which would eventually fail pause a job which is in a perma-stuck.

soltysh commented 3 years ago

Hmm... I just tried with an explicitly failing job:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: my-job
spec:
  jobTemplate:
    metadata:
      name: my-job
    spec:
      template:
        metadata:
        spec:
          containers:
          - image: busybox
            name: my-job
            args:
            - "/bin/false"
          restartPolicy: OnFailure
  schedule: '*/1 * * * *'
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1

it does take some longer wait, but eventually the job controller fails the job, it just takes significant amount of time, until a pod reaches the error state.

What the job looked like in your situation, where pod wasn't accounted as failed?

Rkapoor1707 commented 2 years ago

I am facing the same issue, the cronjob pod errors out into crashloopbackoff due to some issue, and the following pods just go into pending state. I was able to resolve the issue with crashloopbackoff but have to manually delete all cron jobs to terminate the pods stuck in pending state. It would be good to have these pods either not created in the first place because the previous jobs are not failing out/ stuck or keep terminating them after a certain amount of time instead of spinning up new ones.

I tried setting both .spec.activeDeadlineSeconds and .spec.progressDeadlineSeconds in the cronjob but both did not work. I have backoffLimit set to 0 but that does not terminate any pods.

Has anyone been able to successfully test using another cron job to delete such stuck pods?

alculquicondor commented 2 years ago

I tried setting both .spec.activeDeadlineSeconds and .spec.progressDeadlineSeconds in the cronjob but both did not work.

Can you elaborate? Those are fields for the Job spec. So you have to put them as part of .jobTemplace.spec

jdnurmi commented 1 year ago

Just for someone else who runs across this and is confused - those fields apply to the job - if you care about cleaning up the pod, it's probably easiest to set ttlSecondsAfterFinished on the jobTemplate