Open civik opened 6 years ago
/sig apps
cc @soltysh
@civik @imiskolee can you folks provide a situation where your pod failed in a cronjob. I'm specifically interested the phase of the pod (see official docs). The only one that comes to my mind of top of my head is when I specify wrong image, for example. In that case the pod is not failed, but pending. Which means none of the controllers (job nor cronjob) can qualify this execution as a failed one and do anything about it. So no removal can happen actually.
There are a few possible approaches to this problem:
activeDeadlineSeconds
for a job, which will fail a job after its exceeded its durationbackoffLimit
is set, which is responsible for retries after which a job is failed. Although in the particular example I gave (with the wrong pullspec) this won't help. Personally, I try to combine the two usually for tighter control.
I've also created https://github.com/kubernetes/kubernetes/issues/58384 to discuss the start timeout for a job.
@soltysh Thanks for the update. I'm thinking the issues I'm seeing are due to jobs that will create another pod with restartPolicy set to OnFailure or Always that go into CrashBackLoop. The job will happily keep stamping out pods that sit in a restart loop. Is there some sort of timer that could be set on the parent job that could kill anything it created on a failure?
@civik iiuc your job is creating another pod, in which case there's no controller owning your pod. In that case you have two options:
I'm seeing this happen as well (1.7.3) - successfulJobsHistoryLimit
(set to 2) works fine, but failedJobsHistoryLimit
(set to 5) will end up with hundreds of pods in CrashLoopBackOff
until eventually it hits my nodes' resource limits and then they just stack up in Pending
Pending pods are not failed ones and thus the controller won't be able to clean them.
Same problem for me: I've got ~8000 pods in state "Error" when failedJobsHistoryLimit was set to 5. The cronjob had wrong environment variable so containers were failed trying to start in application level. From the K8s side the configuration was ok, but internal application error led to this situation.
@soltysh Correct - however, it should be reaping the ones in Error
and CrashLoopBackOff
. If it does that correctly, the cluster's resource limits aren't exhausted and they never stack up in Pending
.
@KIVagant @mcronce can you give me the yaml of the pod status you're having in Error
state? For CrashLoopBackOff
is specific, but unfortunately it does not give definite answer that the pod failed. If you look carefully through pod status you'll see it's in the waiting state, scheduled, initialized and waiting for further actions. Nowhere in the code we have any kind of special casing for situation such as this one, I'm hesitant on adding those to job controller as well. I'll try to bring this discussion for the next sig-apps and see what's the outcome.
@soltysh Right now I don't have any, I've been manually clearing them with a little bash one-liner for a while. Next time I experience it, though, I'll grab the YAML and paste it here. Thanks!
Same for me, I've already fixed the root cause for the failed pods and cleared all of them. I can reproduce the situation, but right now I have much bigger problem with cluster and kops, so maybe later.
@soltysh here are the results of describe
and get -o yaml
with a bunch of stuff removed (tried to keep just what is relevant)
Status: Failed
Containers:
kollector:
State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 17 Apr 2018 04:50:26 -0700
Finished: Tue, 17 Apr 2018 04:50:27 -0700
Ready: False
Restart Count: 0
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
apiVersion: v1
kind: Pod
metadata:
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
spec:
containers:
restartPolicy: Never
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2018-04-17T11:50:25Z
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: 2018-04-17T11:50:25Z
message: 'containers with unready status: [kollector]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: 2018-04-17T11:50:25Z
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://7f9ca3488d4e714f1264620b2385cbf2b8ced40de26e6f5a0ec22e73385701ed
lastState: {}
name: kollector
ready: false
restartCount: 0
state:
terminated:
containerID: docker://7f9ca3488d4e714f1264620b2385cbf2b8ced40de26e6f5a0ec22e73385701ed
exitCode: 1
finishedAt: 2018-04-17T11:50:27Z
reason: Error
startedAt: 2018-04-17T11:50:26Z
phase: Failed
qosClass: Burstable
startTime: 2018-04-17T11:50:25Z
Apparently there's https://github.com/kubernetes/kubernetes/issues/62382 which I fixed in https://github.com/kubernetes/kubernetes/pull/63650. Maybe you're hitting that?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
I think this might be still an active issue impacting operators. Can anyone confirm if this was fixed by #63650? I dont have an environment in which to test this right now.
@civik nope, the linked PR is for handling backoffs, not to address problems with Error
state, which is far more complicated, like I said before.
Ya I think we've run into this issue as well with pods in Error
state (3000+). I think that the error we'd see (This was a developers cron as opposed to something from the operator side) is that the cronjob config within the container would be incorrect so the image would pull, get started and then error out.
I'm also not sure if activeDeadlineSeconds
would help with this case. The general pattern we've seen is that the containers would start, the cronjob would fail, enter failed state and then more would get spun up. The containers are in failed state so they wouldn't get cleaned up and we'd run out of pods on our nodes.
We're looking at adding a Pod limit to the namespace to contain the issue in the meantime.
activeDeadlineSeconds
should kill the job upon exceeding this time, iow. failing it and being available to be removed.
@soltysh Thanks for the tip, does activeDeadlineSeconds
behave any differently between how it's defined in batch spec vs pod spec?
I'll give it a try adding that to our batch spec to see if it solves the issue of 1000s of pods for us.
It would appear this isn't very intuitive and has burned quite a few people, should this perhaps have a defined default to keep people from shooting themselves in the foot? If not that the entire way that the various timeouts/reaping/etc. for cronjobs appears to be very confusing either way.
Also here is our job spec that goes into error on 1.10.4.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: routing-process-csv--stage
spec:
# Every 0700 UTC (0000 Pacific Time)
schedule: "0 7 * * *"
successfulJobsHistoryLimit: 1
suspend: false
jobTemplate:
spec:
template:
spec:
containers:
- name: routing-process-csv--stage
image: <URL>:2012/atm-cron-jobs-base
args:
- curl -v -X GET <API Endpoint/webhook>
imagePullPolicy: Always
imagePullSecrets:
- name: <redacted>-2012
restartPolicy: Never
backoffLimit: 3
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
hmm seems to be still happening
I have a container that doesn't exit and another getting exit code 1 resulting in Error
but these pods doesn't get reaped
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be2
8623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform
:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.7-gke.25", GitCommit:"d4c79083ab6dea5d26ef4
ed8d50b145268349bc3", GitTreeState:"clean", BuildDate:"2019-06-22T16:10:31Z", GoVersion:"go1.10.8b4", Compiler:"gc
", Platform:"linux/amd64"}
/reopen
We are still seeing this on 1.13
and 1.14
@mrak: You can't reopen an issue/PR unless you authored it or you are a collaborator.
/reopen
Seeing this on 1.14.6 ATM
@2rs2ts: You can't reopen an issue/PR unless you authored it or you are a collaborator.
@civik can you reopen this?
Hey guys,
what's the state of this issue? Since it wasn't reopened I am assuming it was maybe fixed? But we are still seeing this issue in 1.14
It was probably not fixed, people just ghost on their own issues :/
Should I file a duplicate issue since the OP has not reopened the issue?
reopening this because i see a lot of attempts to do so (only org members can use prow commands). /reopen
@alejandrox1: Reopened this issue.
gonna freeze this until someone wants to volunteer to work on this /lifecycle frozen
I thought I ran into this with an easy-to-reproduce example... but in the end it validates that .spec.backoffLimit
works as intended. I note that the other examples with information to reproduce all happen before default .spec.backoffLimit
was introduced.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: curl
spec:
schedule: "0 * * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
containers:
- name: curl
image: buildpack-deps:curl
args:
- /bin/sh
- -ec
- curl http://some-service
restartPolicy: Never
I made a mistake and forgot that some-service
is listening on port 3000, not 80. So curl fails to connect and times out. Came back in the morning and I had 5 empty pods in status Error
. Looks like the default .spec.backoffLimit
value worked just fine for me. I suggest that addition is why we see a sharp dropoff in interest in this issue.
For future developers who feel that they've run into this problem:
job
resource has good debugging/logging information for you. Please include the output of kubectl describe job my-job-1596704400
in your post. The backoffLimit
definitely has helped mitigate this but my company has 768 cronjobs in one of our production clusters :) It is a not-too-uncommon occurrence that we get support requests for cronjobs that haven't fired in a while because of this bug. We're on 1.17.8 now and we still get these requests from time to time.
The problem with Error state as presented in kubectl is that these are usually jobs that are running. It's hard in the controller to speculate whether such error state is permanent or temporary. Unless there's a clear Failed signal, the controller won't be able to differentiate between those. So this is not quite a bug.
We have jobs that don't restart when they get an error and they don't get reaped sometimes. So it does seem like a bug to me.
Do you have an example yaml of such a failed pod?
@soltysh if I find a repro case I will share it, however it'll be pretty heavily redacted (company secrets and all that) so I'm not sure how much help that'll be.
This is an issue for us as well:
provisioner-supervise-1607355000-jpn2q 0/1 Error 0 44d
provisioner-supervise-1607355000-lj6lp 0/1 Error 0 44d
provisioner-supervise-1607355000-pjnkr 0/1 Error 0 44d
provisioner-supervise-1607355000-szlpd 0/1 Error 0 44d
provisioner-supervise-1607355000-vfh9z 0/1 Error 0 44d
provisioner-supervise-1607355000-z4rsx 0/1 Error 0 44d
provisioner-supervise-1607355000-zh9vx 0/1 Error 0 44d
provisioner-supervise-1608060600-2vcsd 0/1 Error 0 35d
provisioner-supervise-1608060600-kckfl 0/1 Error 0 35d
provisioner-supervise-1608060600-mdqgp 0/1 Error 0 35d
provisioner-supervise-1608060600-nlgsg 0/1 Error 0 35d
provisioner-supervise-1608060600-zbws7 0/1 Error 0 35d
provisioner-supervise-1608060600-zvgmc 0/1 Error 0 35d
provisioner-supervise-1611159000-dss9j 0/1 Completed 0 9m3s
Our cronjob spec looks like this:
spec:
schedule: "*/10 * * * *"
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 2
Per @soltysh's request in a previous comment, here is the json output of a faileld pod:
{
"apiVersion": "v1",
"kind": "Pod",
"metadata": {
"creationTimestamp": "2020-12-07T15:31:31Z",
"generateName": "provisioner-supervise-1607355000-",
"labels": {
"controller-uid": "8aa58562-fc22-4782-b94e-a2dcb6071328",
"job-name": "provisioner-supervise-1607355000"
},
"managedFields": [
{
"apiVersion": "v1",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:metadata": {
"f:generateName": {},
"f:labels": {
".": {},
"f:controller-uid": {},
"f:job-name": {}
},
"f:ownerReferences": {
".": {},
"k:{\"uid\":\"8aa58562-fc22-4782-b94e-a2dcb6071328\"}": {
".": {},
"f:apiVersion": {},
"f:blockOwnerDeletion": {},
"f:controller": {},
"f:kind": {},
"f:name": {},
"f:uid": {}
}
}
},
"f:spec": {
"f:containers": {
"k:{\"name\":\"provisioner-supervise\"}": {
".": {},
"f:args": {},
"f:image": {},
"f:imagePullPolicy": {},
"f:name": {},
"f:resources": {
".": {},
"f:limits": {
".": {},
"f:cpu": {},
"f:memory": {}
},
"f:requests": {
".": {},
"f:cpu": {},
"f:memory": {}
}
},
"f:terminationMessagePath": {},
"f:terminationMessagePolicy": {}
}
},
"f:dnsPolicy": {},
"f:enableServiceLinks": {},
"f:restartPolicy": {},
"f:schedulerName": {},
"f:securityContext": {},
"f:terminationGracePeriodSeconds": {}
}
},
"manager": "kube-controller-manager",
"operation": "Update",
"time": "2020-12-07T15:31:31Z"
},
{
"apiVersion": "v1",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:status": {
"f:conditions": {
"k:{\"type\":\"ContainersReady\"}": {
".": {},
"f:lastProbeTime": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
},
"k:{\"type\":\"Initialized\"}": {
".": {},
"f:lastProbeTime": {},
"f:lastTransitionTime": {},
"f:status": {},
"f:type": {}
},
"k:{\"type\":\"Ready\"}": {
".": {},
"f:lastProbeTime": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
}
},
"f:containerStatuses": {},
"f:hostIP": {},
"f:phase": {},
"f:podIP": {},
"f:podIPs": {
".": {},
"k:{\"ip\":\"some_ip"}": {
".": {},
"f:ip": {}
}
},
"f:startTime": {}
}
},
"manager": "kubelet",
"operation": "Update",
"time": "2020-12-07T15:31:42Z"
}
],
"name": "provisioner-supervise-1607355000-szlpd",
"namespace": "some_namespace",
"ownerReferences": [
{
"apiVersion": "batch/v1",
"blockOwnerDeletion": true,
"controller": true,
"kind": "Job",
"name": "provisioner-supervise-1607355000",
"uid": "8aa58562-fc22-4782-b94e-a2dcb6071328"
}
],
"resourceVersion": "453999483",
"selfLink": "/api/v1/namespaces/some_namespace/pods/provisioner-supervise-1607355000-szlpd",
"uid": "9dab634b-e100-4847-b371-9125c65b615d"
},
"spec": {
"containers": [
{
"args": [
"/bin/sh",
"-c",
"wget -SO - https://provisioner.example.net/endpoint"
],
"image": "busybox",
"imagePullPolicy": "Always",
"name": "provisioner-supervise",
"resources": {
"limits": {
"cpu": "500m",
"memory": "512Mi"
},
"requests": {
"cpu": "500m",
"memory": "512Mi"
}
},
"terminationMessagePath": "/dev/termination-log",
"terminationMessagePolicy": "File",
"volumeMounts": [
{
"mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
"name": "some-token",
"readOnly": true
}
]
}
],
"dnsPolicy": "ClusterFirst",
"enableServiceLinks": true,
"nodeName": "kubernetes.example.net",
"priority": 0,
"restartPolicy": "Never",
"schedulerName": "default-scheduler",
"securityContext": {},
"serviceAccount": "default",
"serviceAccountName": "default",
"terminationGracePeriodSeconds": 30,
"tolerations": [
{
"effect": "NoExecute",
"key": "node.kubernetes.io/not-ready",
"operator": "Exists",
"tolerationSeconds": 300
},
{
"effect": "NoExecute",
"key": "node.kubernetes.io/unreachable",
"operator": "Exists",
"tolerationSeconds": 300
}
],
"volumes": [
{
"name": "some-token",
"secret": {
"defaultMode": 420,
"secretName": "some-token"
}
}
]
},
"status": {
"conditions": [
{
"lastProbeTime": null,
"lastTransitionTime": "2020-12-07T15:31:31Z",
"status": "True",
"type": "Initialized"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2020-12-07T15:31:31Z",
"message": "containers with unready status: [provisioner-supervise]",
"reason": "ContainersNotReady",
"status": "False",
"type": "Ready"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2020-12-07T15:31:31Z",
"message": "containers with unready status: [provisioner-supervise]",
"reason": "ContainersNotReady",
"status": "False",
"type": "ContainersReady"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2020-12-07T15:31:31Z",
"status": "True",
"type": "PodScheduled"
}
],
"containerStatuses": [
{
"containerID": "docker://e19bfd01a16a63761b4e3370752c54af2854ef4a9e0a4af6fb94a0bd85befa43",
"image": "busybox:latest",
"imageID": "docker-pullable://busybox@sha256:bde48e1751173b709090c2539fdf12d6ba64e88ec7a4301591227ce925f3c678",
"lastState": {},
"name": "provisioner-supervise",
"ready": false,
"restartCount": 0,
"started": false,
"state": {
"terminated": {
"containerID": "docker://e19bfd01a16a63761b4e3370752c54af2854ef4a9e0a4af6fb94a0bd85befa43",
"exitCode": 1,
"finishedAt": "2020-12-07T15:31:41Z",
"reason": "Error",
"startedAt": "2020-12-07T15:31:41Z"
}
}
}
],
"hostIP": "some_ip",
"phase": "Failed",
"podIP": "some_ip",
"podIPs": [
{
"ip": "some_ip"
}
],
"qosClass": "Guaranteed",
"startTime": "2020-12-07T15:31:31Z"
}
}
I think we have a problem in the job controller, not cronjob controller. A somewhat similar situation as here is being described in https://github.com/kubernetes/kubernetes/issues/93783. In both cases job controller will indefinitely try to complete a job, but either due to error in the pod or other issues (quota, wrong pull spec, etc.) the pod will not start or will always fail. We would need a safety mechanism in the job controller which would eventually fail pause a job which is in a perma-stuck.
Hmm... I just tried with an explicitly failing job:
apiVersion: batch/v1
kind: CronJob
metadata:
name: my-job
spec:
jobTemplate:
metadata:
name: my-job
spec:
template:
metadata:
spec:
containers:
- image: busybox
name: my-job
args:
- "/bin/false"
restartPolicy: OnFailure
schedule: '*/1 * * * *'
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
it does take some longer wait, but eventually the job controller fails the job, it just takes significant amount of time, until a pod reaches the error state.
What the job looked like in your situation, where pod wasn't accounted as failed?
I am facing the same issue, the cronjob pod errors out into crashloopbackoff due to some issue, and the following pods just go into pending state. I was able to resolve the issue with crashloopbackoff but have to manually delete all cron jobs to terminate the pods stuck in pending state. It would be good to have these pods either not created in the first place because the previous jobs are not failing out/ stuck or keep terminating them after a certain amount of time instead of spinning up new ones.
I tried setting both .spec.activeDeadlineSeconds and .spec.progressDeadlineSeconds in the cronjob but both did not work. I have backoffLimit set to 0 but that does not terminate any pods.
Has anyone been able to successfully test using another cron job to delete such stuck pods?
I tried setting both .spec.activeDeadlineSeconds and .spec.progressDeadlineSeconds in the cronjob but both did not work.
Can you elaborate? Those are fields for the Job spec. So you have to put them as part of .jobTemplace.spec
Just for someone else who runs across this and is confused - those fields apply to the job
- if you care about cleaning up the pod
, it's probably easiest to set ttlSecondsAfterFinished on the jobTemplate
/kind bug /sig apps
Cronjob limits were defined in #52390 - however it doesn't appear that
failedJobsHistoryLimit
will reap cronjob pods that end up in a state ofError
Cronjob had
failedJobsHistoryLimit
set to 2Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-15T08:51:09Z", GoVersion:"go1.9", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.4", GitCommit:"d6f433224538d4f9ca2f7ae19b252e6fcb66a3ae", GitTreeState:"clean", BuildDate:"2017-05-19T18:33:17Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Centos7.3
uname -a
):4.4.83-1.el7.elrepo.x86_64 #1 SMP Thu Aug 17 09:03:51 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux