Closed gmarek closed 8 years ago
I've investigated the first failure (Pods should be restarted with a docker exec "cat /tmp/health" liveness probe [Conformance]
) and the logs are pretty useless:
By(...)
in tests show the time for when the test ended, so you can't see when each step ran.I've sent #20440 to help with the first issue (for this test).
Most of the remaining tests failed because the test pod didn't start within 5 minutes. For the last few, there are errors saying that the Kubelet isn't ready: INFO: Condition Ready of node gke-jkns-gke-e2e-ci-6ab6c4e9-node-g4b1 is false instead of true. Reason: KubeletNotReady, message: ConfigureCBR0 requested, but PodCIDR not set. Will not configure CBR0 right now
For build 206, docker seems to be in in a bad state and got restarted repeatedly (kubernetes-e2e-gke/206/artifacts/104.154.19.60:22-supervisord.log
).
For build 186 and 243, kubelet sync loop seems to be stuck and not responsive at all. This could be caused by docker ps
hang or kubelet's internal issue. I am leaning towards to the former since we've encountered that in the soak cluster pretty often. We should change docker-checker.sh
to use docker ps
(instead of docker version
) to surface this problem. The script performs the health check every 10s, so the performance impact should be limited. I can also add some more logging to help us better diagnose the problem (although the logging level of a gke test cluster is set to only 2) . /cc @dchen1107
@yujuhong - Did you send a PR to change the docker healthcheck?
Not yet. Will do it soon.
My PR has been merged, assigning to @yujuhong to follow up after her PR goes in.
ref: #9896. We should add timeout to all blocking (docker) calls. This feature is not yet provided by go-dockerclient
Build kubernetes-e2e-gke/318/ exhibited the same symptom: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke/318/consoleFull The supervisord log shows that docker failed its health check long before kubelet failed. https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/logs/kubernetes-e2e-gke/318/artifacts/104.197.223.136%3A22-supervisord.log
2016-02-03 12:00:49,920 CRIT Supervisor running as root (no user in config file)
2016-02-03 12:00:50,015 INFO RPC interface 'supervisor' initialized
2016-02-03 12:00:50,015 WARN cElementTree not installed, using slower XML parser for XML-RPC
2016-02-03 12:00:50,016 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2016-02-03 12:00:50,017 INFO daemonizing the supervisord process
2016-02-03 12:00:50,018 INFO supervisord started with pid 2294
2016-02-03 12:01:42,531 WARN received SIGTERM indicating exit request
2016-02-03 12:01:47,581 CRIT Supervisor running as root (no user in config file)
2016-02-03 12:01:47,581 WARN Included extra file "/etc/supervisor/conf.d/kubelet.conf" during parsing
2016-02-03 12:01:47,581 WARN Included extra file "/etc/supervisor/conf.d/docker.conf" during parsing
2016-02-03 12:01:47,601 INFO RPC interface 'supervisor' initialized
2016-02-03 12:01:47,601 WARN cElementTree not installed, using slower XML parser for XML-RPC
2016-02-03 12:01:47,601 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2016-02-03 12:01:47,601 INFO daemonizing the supervisord process
2016-02-03 12:01:47,602 INFO supervisord started with pid 3398
2016-02-03 12:01:48,604 INFO spawned: 'kubelet' with pid 3409
2016-02-03 12:01:48,606 INFO spawned: 'docker' with pid 3410
2016-02-03 12:01:49,662 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:01:49,662 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:06:10,884 INFO exited: docker (exit status 2; expected)
2016-02-03 12:06:11,886 INFO spawned: 'docker' with pid 16186
2016-02-03 12:06:12,916 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:07:11,939 INFO exited: docker (exit status 2; expected)
2016-02-03 12:07:12,940 INFO spawned: 'docker' with pid 16636
2016-02-03 12:07:13,969 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:10:49,676 INFO exited: kubelet (exit status 2; expected)
2016-02-03 12:10:50,678 INFO spawned: 'kubelet' with pid 17154
2016-02-03 12:10:51,721 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:11:53,573 INFO exited: docker (exit status 2; expected)
2016-02-03 12:11:54,576 INFO spawned: 'docker' with pid 17289
2016-02-03 12:11:55,601 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:13:14,606 INFO exited: docker (exit status 2; expected)
2016-02-03 12:13:15,608 INFO spawned: 'docker' with pid 17416
2016-02-03 12:13:16,638 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:14:35,644 INFO exited: docker (exit status 2; expected)
2016-02-03 12:14:36,646 INFO spawned: 'docker' with pid 17519
2016-02-03 12:14:37,674 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:15:56,679 INFO exited: docker (exit status 2; expected)
2016-02-03 12:15:57,682 INFO spawned: 'docker' with pid 17639
2016-02-03 12:15:58,707 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:16:31,153 INFO exited: kubelet (exit status 2; expected)
2016-02-03 12:16:32,155 INFO spawned: 'kubelet' with pid 17717
2016-02-03 12:16:33,197 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:17:17,712 INFO exited: docker (exit status 2; expected)
2016-02-03 12:17:18,714 INFO spawned: 'docker' with pid 17783
2016-02-03 12:17:19,740 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:17:32,202 INFO exited: kubelet (exit status 2; expected)
2016-02-03 12:17:33,205 INFO spawned: 'kubelet' with pid 17826
2016-02-03 12:17:34,246 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:18:33,253 INFO exited: kubelet (exit status 2; expected)
2016-02-03 12:18:34,254 INFO spawned: 'kubelet' with pid 17888
This issue is likely caused by a kernel bug. xref: https://github.com/docker/docker/issues/5618 https://bugzilla.kernel.org/show_bug.cgi?id=81211 There is not much we can do to work around this and only node reboot can fix it.
And again: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke/441/
This is the main reason for blocked merge queue today.
@yujuhong - given that we can't do anything about it and it's causing major main do you think we can demote this test to flaky?
@yujuhong - given that we can't do anything about it and it's causing major main do you think we can demote this test to flaky?
The last few failures don't seem to have the docker hung problem. However, the logging level in the GCE cluster is too low to identify the root cause. I filed #20661 to increase the log level.
Saw this again in the presubmit run for #18736.
Thanks, the GCE logs are way more informational.
I0207 22:50:55.782281 3402 server.go:569] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:"spec.containers{liveness}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: cat: can't open '/tmp/health': No such file or directory
I0207 22:50:56.554463 3402 manager.go:1611] pod "liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)" container "liveness" is unhealthy, it will be killed and re-created.
I0207 22:50:56.554564 3402 manager.go:1247] Killing container "c6720ce35ac7eee98fb5b7307d01df339138df0092df3901c25f4c2985982589 liveness e2e-tests-pods-m8q2q/liveness-exec" with 30 second grace period
I0207 22:51:26.727865 3402 manager.go:1771] Creating container &{Name:liveness Image:gcr.io/google_containers/busybox Command:[/bin/sh -c echo ok >/tmp/health; sleep 10; rm -rf /tmp/health; sleep 600] Args:[] WorkingDir: Ports:[] Env:[] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:default-token-957s5 ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount}] LivenessProbe:0xc208694e00 ReadinessProbe:<nil> Lifecycle:<nil> TerminationMessagePath:/dev/termination-log ImagePullPolicy:Always SecurityContext:<nil> Stdin:false StdinOnce:false TTY:false} in pod liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)
I0207 22:51:26.728028 3402 server.go:569] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:"spec.containers{liveness}"}): type: 'Normal' reason: 'Killing' Killing container with docker id c6720ce35ac7: pod "liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)" container "liveness" is unhealthy, it will be killed and re-created.
I0207 22:51:26.731311 3402 server.go:569] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:"spec.containers{liveness}"}): type: 'Normal' reason: 'Pulling' pulling image "gcr.io/google_containers/busybox"
I0207 22:51:27.718931 3402 kubelet.go:2339] SyncLoop (PLEG): "liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)", event: &pleg.PodLifecycleEvent{ID:"2cf756cf-cded-11e5-8930-42010af00002", Type:"ContainerDied", Data:"c6720ce35ac7eee98fb5b7307d01df339138df0092df3901c25f4c2985982589"}
------ Test timed out at ~22:52:35. Test deleted pod ---------------
I0207 22:53:12.545027 3402 server.go:569] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:"spec.containers{liveness}"}): type: 'Normal' reason: 'Pulled' Successfully pulled image "gcr.io/google_containers/busybox"
I0207 22:53:12.545051 3402 server.go:569] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: failed to "StartContainer" for "liveness" with RunContainerError: "GenerateRunContainerOptions: impossible: cannot find the mounted volumes for pod \"liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)\""
E0207 22:53:12.547662 3402 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"liveness-exec.1430ca97637a2e4d", GenerateName:"", Namespace:"e2e-tests-pods-m8q2q", SelfLink:"", UID:"", ResourceVersion:"963", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:"spec.containers{liveness}"}, Reason:"Pulled", Message:"Successfully pulled image \"gcr.io/google_containers/busybox\"", Source:api.EventSource{Component:"kubelet", Host:"e2e-gce-master-0-minion-ymkr"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63590482231, nsec:0, loc:(*time.Location)(0x2d7a040)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63590482392, nsec:544889837, loc:(*time.Location)(0x2d7a040)}}, Count:2, Type:"Normal"}': 'events "liveness-exec.1430ca97637a2e4d" not found' (will not retry!)
E0207 22:53:12.549987 3402 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"liveness-exec.1430cabcf47a6fbb", GenerateName:"", Namespace:"e2e-tests-pods-m8q2q", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: failed to \"StartContainer\" for \"liveness\" with RunContainerError: \"GenerateRunContainerOptions: impossible: cannot find the mounted volumes for pod \\\"liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)\\\"\"\n", Source:api.EventSource{Component:"kubelet", Host:"e2e-gce-master-0-minion-ymkr"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63590482392, nsec:544997307, loc:(*time.Location)(0x2d7a040)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63590482392, nsec:544997307, loc:(*time.Location)(0x2d7a040)}}, Count:1, Type:"Warning"}': 'namespaces "e2e-tests-pods-m8q2q" not found' (will not retry!)
In short, kubelet detected the unhealthy container (via liveness probe failure), killed the container, and attempted to create a new container. However, it spent ~100s on pulling the image, which was not able to complete before the test timed out.
From the pod spec, the image pulling policy is ImagePullPolicy:Always
. We should change it to IfNotPresent
to avoid pulling images every time. This may apply to other e2e tests as well.
Filed #20836 to address the issue in all e2e tests.
Forked the original docker ps hang/kernel bug to #20876
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke/186/
There are a couple more failures in this suite - all of them looks like some problem with a Node.
cc @kubernetes/goog-gke @kubernetes/goog-node