Closed dchen1107 closed 7 years ago
There are many other failures in https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial, but this one failed most frequently. Also other failures might share the same root cause as this.
cc/ @caesarxuchao, the branch manager of 1.7 @ericchiang, the test lead of 1.7.
There are multiple causes of the failures. I am looking through them one by one.
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1142/
The test before the restart test is a PD test (should be able to detach from a node which was deleted
) that deletes a GCE VM instance. The PD test does not wait for the node to come up and verify that it becomes ready. This causes the next test (which happens to be the node restart) test to fail because the new node couldn't finish registering itself in 20s.
/cc @kubernetes/sig-storage-misc
After the node restarts, the kube-dns pod (kube-dns-2836479273-r292n
) could not become ready. Looking at the kubelet.log in question, the containers of the kube-dns pod (kubedns
and dnsmasq
) crashed repeatedly.
docker.log also shows that the containers died without external stop/kill commands.
time="2017-05-30T07:11:09Z" level=debug msg="containerd: process exited" id=9d92b38d0eb0b963b42adc1d67332e6966b6e3816746db194a44c4c5f9ce705f pid=init status=255 systemPid=5687
I didn't see anything abnormal in the kernel log and all other pods were ready. There is no archived kube-dns log to help debugging. @kubernetes/sig-network-misc may have more knowledge on what might've caused this.
Edit: similar failures: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1132/ https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1134/ https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1138/ https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1139/ https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1140/
re: https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305042721
xref: https://github.com/kubernetes/kubernetes/issues/45991 cc/@MrHohn was looking into not-ready kube-dns for last alpha release.
There are multiple causes of the failures. I am looking through them one by one.
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1142/ The test before the restart test is a PD test (should be able to detach from a node which was deleted) that deletes a GCE VM instance. The PD test does not wait for the node to come up and verify that it becomes ready. This causes the next test (which happens to be the node restart) test to fail because the new node couldn't finish registering itself in 20s.
/cc @kubernetes/sig-storage-misc
@verult has volunteered to help fix the should be able to detach from a node which was deleted
test so that it will do proper cleanup.
I didn't see anything abnormal in the kernel log and all other pods were ready. There is no archived kube-dns log to help debugging.
Indeed it is hard to tell what happened without kube-dns logs. I'm trying to reproduce on my cluster.
CC @kubernetes/sig-storage-test-failures
Reproduced on my cluster, related logs:
I0531 00:45:38.688241 1 dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
E0531 00:45:38.688626 1 reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Endpoints: Get https://10.0.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp
10.0.0.1:443: i/o timeout
E0531 00:45:38.688911 1 reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Service: Get https://10.0.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10
.0.0.1:443: i/o timeout
I0531 00:45:39.188344 1 dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
I0531 00:45:39.688270 1 dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
I0531 00:45:40.188362 1 dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
...
F0531 00:46:08.688268 1 dns.go:168] Timeout waiting for initialization
Seems like one of the kube-dns pods failed to contact the apiserver because the underlying network is broken after node restart, looking into it.
cc @bowei
Indeed it is hard to tell what happened without kube-dns logs. I'm trying to reproduce on my cluster.
FYI, I uploaded a PR to dump the logs of the unready pods: #46665
I believe there are some issues wrt the kubenet. When I execed onto the broken dns pod, wget the apiserver:443 (with cluster IP and public IP) didn't work (traffic didn't reach the master node). Checked the kubelet logs and saw below entry spread all over the places:
...
W0531 00:39:36.185756 2516 docker_sandbox.go:285] NetworkPlugin kubenet failed on the status hook for pod "kube-dns-2836479273-b6txb_kube-system": Cannot find the network namespace, skipping pod network status for container {"docker" "7dfd8c51851961628dbf0a2bf1a923311e89b943b72ca1707f29c1b9bf02f61d"}
...
# Lots of similar entries.
...
W0531 03:13:28.485903 2516 docker_sandbox.go:285] NetworkPlugin kubenet failed on the status hook for pod "kube-dns-2836479273-b6txb_kube-system": Cannot find the network namespace, skipping pod network status for container {"docker" "7dfd8c51851961628dbf0a2bf1a923311e89b943b72ca1707f29c1b9bf02f61d"}
That dns pod kept crashlooping for couple hours until I created another pod on the same node, which for some reason triggered the dns pod recovered.
kube-dns-2836479273-4dj49 3/3 Running 5 3h
kube-dns-2836479273-b6txb 3/3 Running 70 3h
I never touch any kubenet stuff before, it would be great to have @freehan or @dnardo to take a look...
Okay. I think I figure out what is going on. It is a bug introduced from https://github.com/kubernetes/kubernetes/pull/43879
Here is what I think it happened:
Node restarts. All containers stopped. API server still has those pods as Running.
Node come up. kubelet PLEG.updateCache
-> kubemanager.GetPodStatus
-> CRI PodSandboxStaus
-> GetPodNetworkStatus
.
PLEG requires GetPodStatus to succeed in order to update pod status. Otherwise, the pod will stuck in the same status. After the change of #43879, GetNetNS
will return error for terminated sandbox container and Dockershim selectively eats the error for some cases. But not the case like this.
To unblock the release, reverting #43879 will help.
cc @dcbw
@yujuhong Seems like there is already logic in place in the PD test to check all nodes are back up. Running the restart test alone also gives me the error in certain runs. 20s might just not be long enough to wait for nodes to be ready?
@yujuhong Seems like there is already logic in place in the PD test to check all nodes are back up. Running the restart test alone also gives me the error in certain runs. 20s might just not be long enough to wait for nodes to be ready?
Can the PD test wait until the nodes become ready again?
There's also logic inside AfterEach() in framework.go making sure the nodes are ready, although it includes unschedulable nodes, whereas the check in restart.go excludes them. https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/util.go#L4087
Looking into why this is the case.
@freehan fun, looking into this...
@freehan so, the log messages from https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305077524 will only be printed before #43879 was merged. So for that reason, this flake occurred before, at least if we think @MrHohn problem is the same.
so, the log messages from #46651 (comment) will only be printed before #43879 was merged.
Just checked, that's true. I reproduced the issue with commit e0a6cde6f43e5c628395f8f9b5589c5d6298ec8e, #43879 is not included.
Does it mean reverting #43879 is not going to fix the issue? Do we need to wait for another fix?
To clarify, the symptom I have observed on https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305339470 is on HEAD at the time, which includes #43879. I believe #43879 makes things worse.
For https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305077524, @MrHohn was running a cluster slightly older than HEAD, which did not include #43879
So reverting #43879 solved half of the problem... There may be another bug.
Finally we figured out the cause for the kube-dns crashlooping case, thanks @dnardo and @freehan!
It turned out after resetting the instance, there are certain chances the iptables rule that masquerades the outbound traffic is gone, specifically this rule:
-A POSTROUTING ! -d 10.0.0.0/8 -m comment --comment "kubenet: SNAT for outbound traffic from cluster" -m addrtype ! --dst-type LOCAL -j MASQUERADE
Without this masquerade rule, packet sent from kube-dns pod to apiserver would have the pod IP as the source IP, instead of the node IP. The consequence is packet get dropped by GCE network before reaching the master instance.
In the meanwhile, if we create a pod on that same node, it will trigger kubenet to generate that masquerade rule again, which indirectly makes kube-dns recover.
I just reproduced on my cluster to confirm.
@thockin If we have https://github.com/kubernetes-incubator/ip-masq-agent running by default, it will exactly fix this case?
Re: https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305666083
Wow, we found two release blocker issues here and one test related problem. Thanks @MrHohn, @freehan and @dnardo for network related production issues, and @verult for storage related test issue.
Before we introduce another agent to resolve this issue, can we
Thanks!
@freehan see https://github.com/kubernetes/kubernetes/pull/46823 for an update to the GetNetNS issue.
@dchen1107
Before we introduce another agent to resolve this issue
The ip-masq-agent is scheduled to be added in 1.7 anyway. It is currently not enabled in e2e test config but in the default config.
Root cause why that iptable rule was dropped
iptables rule is not persistent across reboots. And the SNAT rule for outbound traffic is added by kubenet, which is triggered per pod start up and tear down.
Instead of crashing kube-dns, can we log some reason to help debugging in the future
kube-dns does not have much visibility into iptables rules. It can only report the symptom.
Add the regression test(s) to validate the expected iptable rules in various scenario including reset / reboot case
Yes. There should be a network connectivity test after reboot case.
The ip-masq-agent is scheduled to be added in 1.7 anyway. It is currently not enabled in e2e test config but in the default config.
This is weird to me. Why enabled it in 1.7 by default, but not test in e2e even now? Are we talking about a new agent to every node by default? For OSS K8S? Or Only GKE? Do we understand the impact to the users yet? What is the extra resource overhead? Is the agent running as a native daemon? Or daemonset? If that agent failed, who should recover it?
kube-dns does not have much visibility into iptables rules. It can only report the symptom.
Reporting the symptom is all I were asking, not just simply in crashloop.
The issue with kube-dns is that it is failing the health check which tests for health of the entire DNS pipeline. This means when there is a network disruption, the DNS pod will be killed and restarted. I don't see that we have much of a choice there -- either we allow the DNS pod to live even though it is not working (has network connectivity to the API server) or we have it stay alive but not be able to serve anything useful. Thoughts?
Okay. After a debugging session with @MrHohn and @dnardo , we finally figure out why the SNAT rule disappeared. The outbound SNAT rule is ensured during kubelet initialization and every pod setup/teardown.
After reboot, kubelet restarted and tried to recreate all the pods, during which the SNAT rule gets added. Meanwhile, salt reruns the bootstrap scripts after reboot. In the bootstrap script, it flushes the nat table here. Since kubelet and bootstrap scripts runs in parallel, the flush may take place after kubelet finishes syncing all the pods and adding the SNAT rule. Hence leaving the pods on the node not able to reach out of the cluster.
ip-masq-agent can mitigate this by reasserting the SNAT rule periodically. But the fundamental problem is that kubelet should run after bootstrap script completes. For COS, this can be done by adding dependency in systemd config. Not sure how to achieve it in Debian.
Regarding the ip-masq-agent
I think @mtaufen have more context.
I still think there is a race here that ip-masq-agent may not fix. Consider a node is reboot and begins bringing up pods, if that happens somehow before the init script that was flushing IPTables then we will be left without the dnat entries for that pod. While I think that ip-mas-agent helps in this case, I think we should make the init scripts more consistent. Kubelet should start until the rest of the init scripts are run. For debian I think the we just need to make sure the kublet init starts at the end by renaming the init scripts in /etc/rc2.d
It's impossible to know all the historical decisions behind why the startup scripts are the way they are, but to me this sounds logical: Startup scripts should prep the node, and only then should the Kubelet and other K8s components start running.
Anything else smells accidental to me.
The ip-masq-agent might fight back against the startup scripts, but this is really not a solution to the fundamental problem, which is incorrect dependency ordering. I wouldn't be surprised if we find additional issues caused by the incorrect dependency ordering.
On Mon, Jun 5, 2017 at 4:50 PM, Daniel Nardo notifications@github.com wrote:
I still think there is a race here that ip-masq-agent may not fix. Consider a node is reboot and begins bringing up pods, if that happens somehow before the init script that was flushing IPTables then we will be left without the dnat entries for that pod. While I think that ip-mas-agent helps in this case, I think we should make the init scripts more consistent. Kubelet should start until the rest of the init scripts are run. For debian I think the we just need to make sure the kublet init starts at the end by renaming the init scripts in /etc/rc2.d
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-306342161, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3JwewMLQ-jH6yKGrPkJHqnsw04EI8kks5sBJRigaJpZM4Nq8IV .
-- Michael Taufen MTV SWE
I am closing this one since it hasn't failed for a couple of days after kube-dns fix.
Still it's the dns pod not being ready, the cause might be different, it's the same as Zihong observed earlier in https://github.com/kubernetes/kubernetes/issues/46651#issuecomment-305050774
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial-release-1-7/93?log#log
I0626 05:24:23.585] I0626 05:23:38.919030 1 dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
I0626 05:24:23.586] E0626 05:23:38.919808 1 reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Service: Get https://10.0.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I0626 05:24:23.586] E0626 05:23:38.919818 1 reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Endpoints: Get https://10.0.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
I thought the decision was to remove the old kubelet from cvm image. That should solve a bunch of problems, including this one.
@zmerlynn Can you comment on the status?
@zmerlynn Is building a CVM image by removing Kubelet, and will update the test once that is done. On another hand, all serial tests on gci image doesn't have this flakiness.
Close the issue.
FYI this test has continued to be flaky on gce-serial. Need to investigate to determine if it's the same root cause, but would appreciate if someone with more background can look at this https://k8s-testgrid.appspot.com/release-master-blocking#gce-serial
e2e test -- [k8s.io] Restart [Disruptive] should restart all nodes and ensure all nodes and pods recover failed a lot: https://k8s-gubernator.appspot.com/builds/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial. For example: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-serial/1142/
We should triage this failure before cutting beta.0 tomorrow.
cc/ @kubernetes/kubernetes-release-managers @kubernetes/sig-node-bugs