e2e flake: Pods should be restarted with a docker exec "cat /tmp/health" liveness probe [Conformance]

gmarek commented 8 years ago

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke/186/

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/pods.go:581 Feb 1 07:21:12.848: pod e2e-tests-pods-smhu6/liveness-exec - expected number of restarts: %!t(int=1), found restarts: %!t(int=0)

There are a couple more failures in this suite - all of them looks like some problem with a Node.

cc @kubernetes/goog-gke @kubernetes/goog-node

roberthbailey commented 8 years ago

I've investigated the first failure (Pods should be restarted with a docker exec "cat /tmp/health" liveness probe [Conformance]) and the logs are pretty useless:

I just found (but apparently this was somewhat common knowledge) that any timestamps printed when you use By(...) in tests show the time for when the test ended, so you can't see when each step ran.
The timestamps for all of the events are for when the events are captured which doesn't help determine the temporal relationship between events and log statements.

I've sent #20440 to help with the first issue (for this test).

roberthbailey commented 8 years ago

Most of the remaining tests failed because the test pod didn't start within 5 minutes. For the last few, there are errors saying that the Kubelet isn't ready: INFO: Condition Ready of node gke-jkns-gke-e2e-ci-6ab6c4e9-node-g4b1 is false instead of true. Reason: KubeletNotReady, message: ConfigureCBR0 requested, but PodCIDR not set. Will not configure CBR0 right now

yujuhong commented 8 years ago

For build 206, docker seems to be in in a bad state and got restarted repeatedly (kubernetes-e2e-gke/206/artifacts/104.154.19.60:22-supervisord.log).

For build 186 and 243, kubelet sync loop seems to be stuck and not responsive at all. This could be caused by docker ps hang or kubelet's internal issue. I am leaning towards to the former since we've encountered that in the soak cluster pretty often. We should change docker-checker.sh to use docker ps (instead of docker version) to surface this problem. The script performs the health check every 10s, so the performance impact should be limited. I can also add some more logging to help us better diagnose the problem (although the logging level of a gke test cluster is set to only 2) . /cc @dchen1107

roberthbailey commented 8 years ago

@yujuhong - Did you send a PR to change the docker healthcheck?

yujuhong commented 8 years ago

Not yet. Will do it soon.

roberthbailey commented 8 years ago

My PR has been merged, assigning to @yujuhong to follow up after her PR goes in.

yujuhong commented 8 years ago

ref: #9896. We should add timeout to all blocking (docker) calls. This feature is not yet provided by go-dockerclient

ikehz commented 8 years ago

Another example http://kubekins.dls.corp.google.com:8081/job/kubernetes-pull-build-test-e2e-gce/26539/consoleFull

yujuhong commented 8 years ago

Build kubernetes-e2e-gke/318/ exhibited the same symptom: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke/318/consoleFull The supervisord log shows that docker failed its health check long before kubelet failed. https://pantheon.corp.google.com/m/cloudstorage/b/kubernetes-jenkins/o/logs/kubernetes-e2e-gke/318/artifacts/104.197.223.136%3A22-supervisord.log

2016-02-03 12:00:49,920 CRIT Supervisor running as root (no user in config file)
2016-02-03 12:00:50,015 INFO RPC interface 'supervisor' initialized
2016-02-03 12:00:50,015 WARN cElementTree not installed, using slower XML parser for XML-RPC
2016-02-03 12:00:50,016 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2016-02-03 12:00:50,017 INFO daemonizing the supervisord process
2016-02-03 12:00:50,018 INFO supervisord started with pid 2294
2016-02-03 12:01:42,531 WARN received SIGTERM indicating exit request
2016-02-03 12:01:47,581 CRIT Supervisor running as root (no user in config file)
2016-02-03 12:01:47,581 WARN Included extra file "/etc/supervisor/conf.d/kubelet.conf" during parsing
2016-02-03 12:01:47,581 WARN Included extra file "/etc/supervisor/conf.d/docker.conf" during parsing
2016-02-03 12:01:47,601 INFO RPC interface 'supervisor' initialized
2016-02-03 12:01:47,601 WARN cElementTree not installed, using slower XML parser for XML-RPC
2016-02-03 12:01:47,601 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2016-02-03 12:01:47,601 INFO daemonizing the supervisord process
2016-02-03 12:01:47,602 INFO supervisord started with pid 3398
2016-02-03 12:01:48,604 INFO spawned: 'kubelet' with pid 3409
2016-02-03 12:01:48,606 INFO spawned: 'docker' with pid 3410
2016-02-03 12:01:49,662 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:01:49,662 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:06:10,884 INFO exited: docker (exit status 2; expected)
2016-02-03 12:06:11,886 INFO spawned: 'docker' with pid 16186
2016-02-03 12:06:12,916 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:07:11,939 INFO exited: docker (exit status 2; expected)
2016-02-03 12:07:12,940 INFO spawned: 'docker' with pid 16636
2016-02-03 12:07:13,969 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:10:49,676 INFO exited: kubelet (exit status 2; expected)
2016-02-03 12:10:50,678 INFO spawned: 'kubelet' with pid 17154
2016-02-03 12:10:51,721 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:11:53,573 INFO exited: docker (exit status 2; expected)
2016-02-03 12:11:54,576 INFO spawned: 'docker' with pid 17289
2016-02-03 12:11:55,601 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:13:14,606 INFO exited: docker (exit status 2; expected)
2016-02-03 12:13:15,608 INFO spawned: 'docker' with pid 17416
2016-02-03 12:13:16,638 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:14:35,644 INFO exited: docker (exit status 2; expected)
2016-02-03 12:14:36,646 INFO spawned: 'docker' with pid 17519
2016-02-03 12:14:37,674 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:15:56,679 INFO exited: docker (exit status 2; expected)
2016-02-03 12:15:57,682 INFO spawned: 'docker' with pid 17639
2016-02-03 12:15:58,707 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:16:31,153 INFO exited: kubelet (exit status 2; expected)
2016-02-03 12:16:32,155 INFO spawned: 'kubelet' with pid 17717
2016-02-03 12:16:33,197 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:17:17,712 INFO exited: docker (exit status 2; expected)
2016-02-03 12:17:18,714 INFO spawned: 'docker' with pid 17783
2016-02-03 12:17:19,740 INFO success: docker entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:17:32,202 INFO exited: kubelet (exit status 2; expected)
2016-02-03 12:17:33,205 INFO spawned: 'kubelet' with pid 17826
2016-02-03 12:17:34,246 INFO success: kubelet entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2016-02-03 12:18:33,253 INFO exited: kubelet (exit status 2; expected)
2016-02-03 12:18:34,254 INFO spawned: 'kubelet' with pid 17888

This issue is likely caused by a kernel bug. xref: https://github.com/docker/docker/issues/5618 https://bugzilla.kernel.org/show_bug.cgi?id=81211 There is not much we can do to work around this and only node reboot can fix it.

wojtek-t commented 8 years ago

This happened again: http://kubekins.dls.corp.google.com:8080/view/Critical%20Builds/job/kubernetes-e2e-gke/431/consoleFull

gmarek commented 8 years ago

And again: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke/441/

This is the main reason for blocked merge queue today.

gmarek commented 8 years ago

@yujuhong - given that we can't do anything about it and it's causing major main do you think we can demote this test to flaky?

yujuhong commented 8 years ago

@yujuhong - given that we can't do anything about it and it's causing major main do you think we can demote this test to flaky?

The last few failures don't seem to have the docker hung problem. However, the logging level in the GCE cluster is too low to identify the root cause. I filed #20661 to increase the log level.

lavalamp commented 8 years ago

Another occurrence: https://cloud.google.com/console/storage/kubernetes-jenkins/logs/kubernetes-e2e-gke/541/

roberthbailey commented 8 years ago

Saw this again in the presubmit run for #18736.

roberthbailey commented 8 years ago

Logs at https://storage.cloud.google.com/kubernetes-jenkins/pr-logs/pull/18736/kubernetes-pull-build-test-e2e-gce/27348

yujuhong commented 8 years ago

Logs at https://storage.cloud.google.com/kubernetes-jenkins/pr-logs/pull/18736/kubernetes-pull-build-test-e2e-gce/27348

Thanks, the GCE logs are way more informational.

I0207 22:50:55.782281    3402 server.go:569] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:"spec.containers{liveness}"}): type: 'Warning' reason: 'Unhealthy' Liveness probe failed: cat: can't open '/tmp/health': No such file or directory
I0207 22:50:56.554463    3402 manager.go:1611] pod "liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)" container "liveness" is unhealthy, it will be killed and re-created.
I0207 22:50:56.554564    3402 manager.go:1247] Killing container "c6720ce35ac7eee98fb5b7307d01df339138df0092df3901c25f4c2985982589 liveness e2e-tests-pods-m8q2q/liveness-exec" with 30 second grace period
I0207 22:51:26.727865    3402 manager.go:1771] Creating container &{Name:liveness Image:gcr.io/google_containers/busybox Command:[/bin/sh -c echo ok >/tmp/health; sleep 10; rm -rf /tmp/health; sleep 600] Args:[] WorkingDir: Ports:[] Env:[] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:default-token-957s5 ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount}] LivenessProbe:0xc208694e00 ReadinessProbe:<nil> Lifecycle:<nil> TerminationMessagePath:/dev/termination-log ImagePullPolicy:Always SecurityContext:<nil> Stdin:false StdinOnce:false TTY:false} in pod liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)
I0207 22:51:26.728028    3402 server.go:569] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:"spec.containers{liveness}"}): type: 'Normal' reason: 'Killing' Killing container with docker id c6720ce35ac7: pod "liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)" container "liveness" is unhealthy, it will be killed and re-created.
I0207 22:51:26.731311    3402 server.go:569] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:"spec.containers{liveness}"}): type: 'Normal' reason: 'Pulling' pulling image "gcr.io/google_containers/busybox"
I0207 22:51:27.718931    3402 kubelet.go:2339] SyncLoop (PLEG): "liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)", event: &pleg.PodLifecycleEvent{ID:"2cf756cf-cded-11e5-8930-42010af00002", Type:"ContainerDied", Data:"c6720ce35ac7eee98fb5b7307d01df339138df0092df3901c25f4c2985982589"}
------ Test timed out at ~22:52:35. Test deleted pod --------------- 
I0207 22:53:12.545027    3402 server.go:569] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:"spec.containers{liveness}"}): type: 'Normal' reason: 'Pulled' Successfully pulled image "gcr.io/google_containers/busybox"
I0207 22:53:12.545051    3402 server.go:569] Event(api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:""}): type: 'Warning' reason: 'FailedSync' Error syncing pod, skipping: failed to "StartContainer" for "liveness" with RunContainerError: "GenerateRunContainerOptions: impossible: cannot find the mounted volumes for pod \"liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)\""
E0207 22:53:12.547662    3402 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"liveness-exec.1430ca97637a2e4d", GenerateName:"", Namespace:"e2e-tests-pods-m8q2q", SelfLink:"", UID:"", ResourceVersion:"963", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:"spec.containers{liveness}"}, Reason:"Pulled", Message:"Successfully pulled image \"gcr.io/google_containers/busybox\"", Source:api.EventSource{Component:"kubelet", Host:"e2e-gce-master-0-minion-ymkr"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63590482231, nsec:0, loc:(*time.Location)(0x2d7a040)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63590482392, nsec:544889837, loc:(*time.Location)(0x2d7a040)}}, Count:2, Type:"Normal"}': 'events "liveness-exec.1430ca97637a2e4d" not found' (will not retry!)
E0207 22:53:12.549987    3402 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"liveness-exec.1430cabcf47a6fbb", GenerateName:"", Namespace:"e2e-tests-pods-m8q2q", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-pods-m8q2q", Name:"liveness-exec", UID:"2cf756cf-cded-11e5-8930-42010af00002", APIVersion:"v1", ResourceVersion:"574", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: failed to \"StartContainer\" for \"liveness\" with RunContainerError: \"GenerateRunContainerOptions: impossible: cannot find the mounted volumes for pod \\\"liveness-exec_e2e-tests-pods-m8q2q(2cf756cf-cded-11e5-8930-42010af00002)\\\"\"\n", Source:api.EventSource{Component:"kubelet", Host:"e2e-gce-master-0-minion-ymkr"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63590482392, nsec:544997307, loc:(*time.Location)(0x2d7a040)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63590482392, nsec:544997307, loc:(*time.Location)(0x2d7a040)}}, Count:1, Type:"Warning"}': 'namespaces "e2e-tests-pods-m8q2q" not found' (will not retry!)

In short, kubelet detected the unhealthy container (via liveness probe failure), killed the container, and attempted to create a new container. However, it spent ~100s on pulling the image, which was not able to complete before the test timed out.

From the pod spec, the image pulling policy is ImagePullPolicy:Always. We should change it to IfNotPresent to avoid pulling images every time. This may apply to other e2e tests as well.

yujuhong commented 8 years ago

Filed #20836 to address the issue in all e2e tests.

yujuhong commented 8 years ago

Forked the original docker ps hang/kernel bug to #20876

kubernetes / kubernetes

e2e flake: Pods should be restarted with a docker exec "cat /tmp/health" liveness probe [Conformance] #20418