kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.31k stars 39.46k forks source link

E2E Flakiness: [k8s.io] Restart [Disruptive] should restart all nodes and ensure all nodes and pods recover #46651

Closed dchen1107 closed 7 years ago

dchen1107 commented 7 years ago

e2e test -- [k8s.io] Restart [Disruptive] should restart all nodes and ensure all nodes and pods recover failed a lot: https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial. For example: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1142/

go run hack/e2e.go -v -test --test_args='--ginkgo.focus=Restart\s\[Disruptive\]\sshould\srestart\sall\snodes\sand\sensure\sall\snodes\sand\spods\srecover$'
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/restart.go:125
May 30 00:07:42.719: At least one pod wasn't running and ready after the restart.
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/restart.go:123from junit_01.xml

We should triage this failure before cutting beta.0 tomorrow.

cc/ @kubernetes/kubernetes-release-managers @kubernetes/sig-node-bugs

dchen1107 commented 7 years ago

There are many other failures in https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial, but this one failed most frequently. Also other failures might share the same root cause as this.

cc/ @caesarxuchao, the branch manager of 1.7 @ericchiang, the test lead of 1.7.

yujuhong commented 7 years ago

There are multiple causes of the failures. I am looking through them one by one.

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1142/ The test before the restart test is a PD test (should be able to detach from a node which was deleted) that deletes a GCE VM instance. The PD test does not wait for the node to come up and verify that it becomes ready. This causes the next test (which happens to be the node restart) test to fail because the new node couldn't finish registering itself in 20s.

/cc @kubernetes/sig-storage-misc

yujuhong commented 7 years ago

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1143/#k8sio-restart-disruptive-should-restart-all-nodes-and-ensure-all-nodes-and-pods-recover

After the node restarts, the kube-dns pod (kube-dns-2836479273-r292n) could not become ready. Looking at the kubelet.log in question, the containers of the kube-dns pod (kubedns and dnsmasq) crashed repeatedly.

docker.log also shows that the containers died without external stop/kill commands.

time="2017-05-30T07:11:09Z" level=debug msg="containerd: process exited" id=9d92b38d0eb0b963b42adc1d67332e6966b6e3816746db194a44c4c5f9ce705f pid=init status=255 systemPid=5687 

I didn't see anything abnormal in the kernel log and all other pods were ready. There is no archived kube-dns log to help debugging. @kubernetes/sig-network-misc may have more knowledge on what might've caused this.

Edit: similar failures: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1132/ https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1134/ https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1138/ https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1139/ https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1140/

dchen1107 commented 7 years ago

re: https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305042721

xref: https://github.com/kubernetes/kubernetes/issues/45991 cc/@MrHohn was looking into not-ready kube-dns for last alpha release.

saad-ali commented 7 years ago

There are multiple causes of the failures. I am looking through them one by one.

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1142/ The test before the restart test is a PD test (should be able to detach from a node which was deleted) that deletes a GCE VM instance. The PD test does not wait for the node to come up and verify that it becomes ready. This causes the next test (which happens to be the node restart) test to fail because the new node couldn't finish registering itself in 20s.

/cc @kubernetes/sig-storage-misc

@verult has volunteered to help fix the should be able to detach from a node which was deleted test so that it will do proper cleanup.

MrHohn commented 7 years ago

I didn't see anything abnormal in the kernel log and all other pods were ready. There is no archived kube-dns log to help debugging.

Indeed it is hard to tell what happened without kube-dns logs. I'm trying to reproduce on my cluster.

saad-ali commented 7 years ago

CC @kubernetes/sig-storage-test-failures

MrHohn commented 7 years ago

Reproduced on my cluster, related logs:

I0531 00:45:38.688241       1 dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
E0531 00:45:38.688626       1 reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Endpoints: Get https://10.0.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp
 10.0.0.1:443: i/o timeout
E0531 00:45:38.688911       1 reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Service: Get https://10.0.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10
.0.0.1:443: i/o timeout
I0531 00:45:39.188344       1 dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
I0531 00:45:39.688270       1 dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
I0531 00:45:40.188362       1 dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
...
F0531 00:46:08.688268       1 dns.go:168] Timeout waiting for initialization

Seems like one of the kube-dns pods failed to contact the apiserver because the underlying network is broken after node restart, looking into it.

cc @bowei

yujuhong commented 7 years ago

Indeed it is hard to tell what happened without kube-dns logs. I'm trying to reproduce on my cluster.

FYI, I uploaded a PR to dump the logs of the unready pods: #46665

MrHohn commented 7 years ago

I believe there are some issues wrt the kubenet. When I execed onto the broken dns pod, wget the apiserver:443 (with cluster IP and public IP) didn't work (traffic didn't reach the master node). Checked the kubelet logs and saw below entry spread all over the places:

...
W0531 00:39:36.185756    2516 docker_sandbox.go:285] NetworkPlugin kubenet failed on the status hook for pod "kube-dns-2836479273-b6txb_kube-system": Cannot find the network namespace, skipping pod network status for container {"docker" "7dfd8c51851961628dbf0a2bf1a923311e89b943b72ca1707f29c1b9bf02f61d"}
...
# Lots of similar entries.
...
W0531 03:13:28.485903    2516 docker_sandbox.go:285] NetworkPlugin kubenet failed on the status hook for pod "kube-dns-2836479273-b6txb_kube-system": Cannot find the network namespace, skipping pod network status for container {"docker" "7dfd8c51851961628dbf0a2bf1a923311e89b943b72ca1707f29c1b9bf02f61d"}

That dns pod kept crashlooping for couple hours until I created another pod on the same node, which for some reason triggered the dns pod recovered.

kube-dns-2836479273-4dj49                             3/3       Running     5          3h
kube-dns-2836479273-b6txb                             3/3       Running     70         3h

I never touch any kubenet stuff before, it would be great to have @freehan or @dnardo to take a look...

freehan commented 7 years ago

Okay. I think I figure out what is going on. It is a bug introduced from https://github.com/kubernetes/kubernetes/pull/43879

Here is what I think it happened: Node restarts. All containers stopped. API server still has those pods as Running. Node come up. kubelet PLEG.updateCache -> kubemanager.GetPodStatus -> CRI PodSandboxStaus -> GetPodNetworkStatus.

PLEG requires GetPodStatus to succeed in order to update pod status. Otherwise, the pod will stuck in the same status. After the change of #43879, GetNetNS will return error for terminated sandbox container and Dockershim selectively eats the error for some cases. But not the case like this.

To unblock the release, reverting #43879 will help.

cc @dcbw

verult commented 7 years ago

@yujuhong Seems like there is already logic in place in the PD test to check all nodes are back up. Running the restart test alone also gives me the error in certain runs. 20s might just not be long enough to wait for nodes to be ready?

yujuhong commented 7 years ago

@yujuhong Seems like there is already logic in place in the PD test to check all nodes are back up. Running the restart test alone also gives me the error in certain runs. 20s might just not be long enough to wait for nodes to be ready?

Can the PD test wait until the nodes become ready again?

verult commented 7 years ago

There's also logic inside AfterEach() in framework.go making sure the nodes are ready, although it includes unschedulable nodes, whereas the check in restart.go excludes them. https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/util.go#L4087

Looking into why this is the case.

dcbw commented 7 years ago

@freehan fun, looking into this...

dcbw commented 7 years ago

@freehan so, the log messages from https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305077524 will only be printed before #43879 was merged. So for that reason, this flake occurred before, at least if we think @MrHohn problem is the same.

MrHohn commented 7 years ago

so, the log messages from #46651 (comment) will only be printed before #43879 was merged.

Just checked, that's true. I reproduced the issue with commit e0a6cde6f43e5c628395f8f9b5589c5d6298ec8e, #43879 is not included.

caesarxuchao commented 7 years ago

Does it mean reverting #43879 is not going to fix the issue? Do we need to wait for another fix?

freehan commented 7 years ago

To clarify, the symptom I have observed on https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305339470 is on HEAD at the time, which includes #43879. I believe #43879 makes things worse.

For https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305077524, @MrHohn was running a cluster slightly older than HEAD, which did not include #43879

So reverting #43879 solved half of the problem... There may be another bug.

MrHohn commented 7 years ago

Finally we figured out the cause for the kube-dns crashlooping case, thanks @dnardo and @freehan!

It turned out after resetting the instance, there are certain chances the iptables rule that masquerades the outbound traffic is gone, specifically this rule:

-A POSTROUTING ! -d 10.0.0.0/8 -m comment --comment "kubenet: SNAT for outbound traffic from cluster" -m addrtype ! --dst-type LOCAL -j MASQUERADE

Without this masquerade rule, packet sent from kube-dns pod to apiserver would have the pod IP as the source IP, instead of the node IP. The consequence is packet get dropped by GCE network before reaching the master instance.

In the meanwhile, if we create a pod on that same node, it will trigger kubenet to generate that masquerade rule again, which indirectly makes kube-dns recover.

I just reproduced on my cluster to confirm.

MrHohn commented 7 years ago

@thockin If we have https://github.com/kubernetes-incubator/ip-masq-agent running by default, it will exactly fix this case?

dchen1107 commented 7 years ago

Re: https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305666083

Wow, we found two release blocker issues here and one test related problem. Thanks @MrHohn, @freehan and @dnardo for network related production issues, and @verult for storage related test issue.

Before we introduce another agent to resolve this issue, can we

Thanks!

dcbw commented 7 years ago

@freehan see https://github.com/kubernetes/kubernetes/pull/46823 for an update to the GetNetNS issue.

freehan commented 7 years ago

@dchen1107

Before we introduce another agent to resolve this issue

The ip-masq-agent is scheduled to be added in 1.7 anyway. It is currently not enabled in e2e test config but in the default config.

Root cause why that iptable rule was dropped

iptables rule is not persistent across reboots. And the SNAT rule for outbound traffic is added by kubenet, which is triggered per pod start up and tear down.

Instead of crashing kube-dns, can we log some reason to help debugging in the future

kube-dns does not have much visibility into iptables rules. It can only report the symptom.

Add the regression test(s) to validate the expected iptable rules in various scenario including reset / reboot case

Yes. There should be a network connectivity test after reboot case.

dchen1107 commented 7 years ago

The ip-masq-agent is scheduled to be added in 1.7 anyway. It is currently not enabled in e2e test config but in the default config.

This is weird to me. Why enabled it in 1.7 by default, but not test in e2e even now? Are we talking about a new agent to every node by default? For OSS K8S? Or Only GKE? Do we understand the impact to the users yet? What is the extra resource overhead? Is the agent running as a native daemon? Or daemonset? If that agent failed, who should recover it?

kube-dns does not have much visibility into iptables rules. It can only report the symptom.

Reporting the symptom is all I were asking, not just simply in crashloop.

bowei commented 7 years ago

The issue with kube-dns is that it is failing the health check which tests for health of the entire DNS pipeline. This means when there is a network disruption, the DNS pod will be killed and restarted. I don't see that we have much of a choice there -- either we allow the DNS pod to live even though it is not working (has network connectivity to the API server) or we have it stay alive but not be able to serve anything useful. Thoughts?

freehan commented 7 years ago

Okay. After a debugging session with @MrHohn and @dnardo , we finally figure out why the SNAT rule disappeared. The outbound SNAT rule is ensured during kubelet initialization and every pod setup/teardown.

After reboot, kubelet restarted and tried to recreate all the pods, during which the SNAT rule gets added. Meanwhile, salt reruns the bootstrap scripts after reboot. In the bootstrap script, it flushes the nat table here. Since kubelet and bootstrap scripts runs in parallel, the flush may take place after kubelet finishes syncing all the pods and adding the SNAT rule. Hence leaving the pods on the node not able to reach out of the cluster.

ip-masq-agent can mitigate this by reasserting the SNAT rule periodically. But the fundamental problem is that kubelet should run after bootstrap script completes. For COS, this can be done by adding dependency in systemd config. Not sure how to achieve it in Debian.

Regarding the ip-masq-agent I think @mtaufen have more context.

dnardo commented 7 years ago

I still think there is a race here that ip-masq-agent may not fix. Consider a node is reboot and begins bringing up pods, if that happens somehow before the init script that was flushing IPTables then we will be left without the dnat entries for that pod. While I think that ip-mas-agent helps in this case, I think we should make the init scripts more consistent. Kubelet should start until the rest of the init scripts are run. For debian I think the we just need to make sure the kublet init starts at the end by renaming the init scripts in /etc/rc2.d

mtaufen commented 7 years ago

It's impossible to know all the historical decisions behind why the startup scripts are the way they are, but to me this sounds logical: Startup scripts should prep the node, and only then should the Kubelet and other K8s components start running.

Anything else smells accidental to me.

The ip-masq-agent might fight back against the startup scripts, but this is really not a solution to the fundamental problem, which is incorrect dependency ordering. I wouldn't be surprised if we find additional issues caused by the incorrect dependency ordering.

On Mon, Jun 5, 2017 at 4:50 PM, Daniel Nardo notifications@github.com wrote:

I still think there is a race here that ip-masq-agent may not fix. Consider a node is reboot and begins bringing up pods, if that happens somehow before the init script that was flushing IPTables then we will be left without the dnat entries for that pod. While I think that ip-mas-agent helps in this case, I think we should make the init scripts more consistent. Kubelet should start until the rest of the init scripts are run. For debian I think the we just need to make sure the kublet init starts at the end by renaming the init scripts in /etc/rc2.d

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-306342161, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3JwewMLQ-jH6yKGrPkJHqnsw04EI8kks5sBJRigaJpZM4Nq8IV .

-- Michael Taufen MTV SWE

dchen1107 commented 7 years ago

I am closing this one since it hasn't failed for a couple of days after kube-dns fix.

caesarxuchao commented 7 years ago

The flake occurs again in recent serail tests: https://k8s-testgrid.appspot.com/release-1.7-blocking#gce-serial-release-1-7&width=20 https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial-release-1-7/94

caesarxuchao commented 7 years ago

Still it's the dns pod not being ready, the cause might be different, it's the same as Zihong observed earlier in https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305050774 https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial-release-1-7/93?log#log

I0626 05:24:23.585] I0626 05:23:38.919030       1 dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
I0626 05:24:23.586] E0626 05:23:38.919808       1 reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Service: Get https://10.0.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I0626 05:24:23.586] E0626 05:23:38.919818       1 reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Endpoints: Get https://10.0.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
freehan commented 7 years ago

I thought the decision was to remove the old kubelet from cvm image. That should solve a bunch of problems, including this one.

freehan commented 7 years ago

@zmerlynn Can you comment on the status?

dchen1107 commented 7 years ago

@zmerlynn Is building a CVM image by removing Kubelet, and will update the test once that is done. On another hand, all serial tests on gci image doesn't have this flakiness.

Close the issue.

ericchiang commented 7 years ago

FYI this test has continued to be flaky on gce-serial. Need to investigate to determine if it's the same root cause, but would appreciate if someone with more background can look at this https://k8s-testgrid.appspot.com/release-master-blocking#gce-serial