kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.32k stars 39.46k forks source link

IPv6: E2E tests failing with DNS failing liveness/readiness probes #63922

Closed pmichali closed 6 years ago

pmichali commented 6 years ago

/kind bug /area ipv6 /sig network

What happened: Running a subset of E2E tests, focused on IPv6. During most runs, see kube-dns (same with coredns) restarting. Sometimes only a few times, sometimes 20+ times, and occasionally, without recovering.

On some runs, the restarts seem to cause test failures, as the pre-test step fail because kube-dns pod is not up.

What you expected to happen: Tests to run w/o kube-dns restarting, and expecting these tests that are failing intermittently, because kube-dns is not up, would then pass.

How to reproduce it (as minimally and precisely as possible): Using a cluster created by using kubeadm-dind-cluster (DinD) with IPv6 only mode (using IPv6 for all addresses). This uses KubeAdm to create the cluster in three containers on a host (using bare-metal host in my case). Run this subset of tests:

go run hack/e2e.go -- --provider=local --v 4 --test --test_args="--ginkgo.focus=Networking|Services --ginkgo.skip=IPv4|DNS|Networking-Performance|Federation|functioning\sNodePort|preserve\ssource\spod --num-nodes=2"

Anything else we need to know?: It seems that, when we run the tests on a bare-metal cluster of three nodes, not using DinD, the kube-dns pod does not restart.

Environment:

pmichali commented 6 years ago

In kubelet logs, had seen some errors. One was conntrack failing to delete NAT connections at test teardown. Found that needed conntrack 1.4.5. Updated repo with source build of conntrack 1.4.5 and errors are gone now, but no affect to issue. Have issue #63208 for that occurrence.

Found CNI failures on pod deletion, where it was using iptables command with IPv6 addresses. Updated repo with CNI 0.7.1, and no longer see these failures (it is correctly using ip6tables). Still does not have an effect on DNS restarts.

See in log that kube-dns stops responding to both readiness (port 8081) and liveness (port 10054) probes, and then is killed/restarted. Appears that sometimes tests are running, when this occurs and the test fails pre-test check as kube-dns is down and a timeout occurs.

Looked at all dnsmasq, sidecar, and kube-dns logs and see no mention of any errors in those logs. Coworker was SSHed into kube-dns, when the failure occurred. So, it seems like the pod is working, but somehow communication is failing on these ports.

pmichali commented 6 years ago

I see in the kube-dns sidecar log a mention of Prometheus and port 10054. Can anyone elaborate on that relationship?

I0515 12:44:53.359539 1 server.go:45] Starting server (options {DnsMasqPort:53 DnsMasqAddr:127.0.0.1 DnsMasqPollIntervalMs:5000 Probes:[{Label:kubedns Server:[::1]:10053 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:33} {Label:dnsmasq Server:[::1]:53 Name:kubernetes.default.svc.cluster.local. Interval:5s Type:33}] PrometheusAddr:0.0.0.0 PrometheusPort:10054 PrometheusPath:/metrics PrometheusNamespace:kubedns})

I'm curious as to the IP address (and that it is a V4 address). Would like to better understand how the probing works, and if there is a way that I can manually probe the pod from the command line. Can anyone advise?

pmichali commented 6 years ago

I have a lead on what is failing here, just need to understand what causes it and why, so that a fix can be crafted.

On one run, I found that kube-dns was stuck in a crash loop. Checking, I found that pings were failing. In particular, checking from the bridge for the pod network, I was seeing the ping request going to the kube-dns pod, and a response returned by the pod. This response, on the minion host, was not going back to the master node, where I was doing the ping.

Later, kube-dns came up, and pings worked. During the investigation, I checked IPTables rules on the minion node. In the working case, there were ip6tables rules for the kube-dns pod, and during the failure, the rules were missing.

347a352,355
> -A KUBE-SEP-JMLTSOAZKUFK2V22 -s fd00:40::4:0:0:105/128 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-MARK-MASQ
> -A KUBE-SEP-JMLTSOAZKUFK2V22 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp" -m tcp -j DNAT --to-destination [fd00:40::4:0:0:105]:53
> -A KUBE-SEP-MHWQADGXDPS3HPY7 -s fd00:40::4:0:0:105/128 -m comment --comment "kube-system/kube-dns:dns" -j KUBE-MARK-MASQ
> -A KUBE-SEP-MHWQADGXDPS3HPY7 -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination [fd00:40::4:0:0:105]:53
350a359,360
> -A KUBE-SERVICES -d fd00:30::a/128 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
> -A KUBE-SERVICES -d fd00:30::a/128 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
351a362
> -A KUBE-SVC-ERIFXISQEP7F7OF4 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-SEP-JMLTSOAZKUFK2V22
353a365
> -A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns" -j KUBE-SEP-MHWQADGXDPS3HPY7

The questions are:

I'll keep digging, but this seems to be the cause of the restarts. If anyone has any thoughts, please let me know.

BTW: I finally figured out my confusion on port 10054. It looks like this port is used both for liveness tests (/healthcheck/kubedns), and for metrics (/metrics) by Prometheus. Unrelated to the problem at hand.

pmichali commented 6 years ago

It looks like these IPTables rules are for the DNS port (53), and are related to the service. Likely a side effect of the kube-dns pod being restarted by kubelet and not a cause of the connectivity problem.

So far, the one known fact is that, when we see the kube-dns resets, we cannot ping the kubedns pod from the master node. I see ping packets get to the pod, and I see a ping response on the pod network bridge, but I do not see the response packet on the minion node's main interface.

Don't know why the packets are not going back out the main interface. The IPv6 routes look fine (and are unchanged), and we see packets going between the minion's mgmt I/F and the master's mgmt I/F.

For CNI, I'm using 0.7.1, and the bridge plugin which has these settings:

{
    "cniVersion": "0.3.0",
    "name": "bmbridge",
    "type": "bridge",
    "bridge": "br0",
    "isDefaultGateway": true,
    "ipMasq": true,
    "hairpinMode": true,
    "ipam": {
        "type": "host-local",
        "ranges": [
          [
            {
              "subnet": "fd00:40:0:0:4::/80",
              "gateway": "fd00:40:0:0:4::1"
        }
          ]
        ]
    }
}

One unusually thing, is that, over multiple test runs, the ip6tables NAT rules show multiple duplicate entries (here I grepped for two of them, but there are hundreds and growing):

-N CNI-00e69fa9d35f761e59695015
-A POSTROUTING -s fd00:40:0:0:4::/80 -m comment --comment "name: \"bmbridge\" id: \"0f0834bfdf1cd266e5fe43b885b70f4bd6e6af31c280cc099718045a7d8dae11\"" -j CNI-00e69fa9d35f761e59695015
-A CNI-00e69fa9d35f761e59695015 -d fd00:40:0:0:4::/80 -m comment --comment "name: \"bmbridge\" id: \"0f0834bfdf1cd266e5fe43b885b70f4bd6e6af31c280cc099718045a7d8dae11\"" -j ACCEPT
-A CNI-00e69fa9d35f761e59695015 ! -d ff00::/8 -m comment --comment "name: \"bmbridge\" id: \"0f0834bfdf1cd266e5fe43b885b70f4bd6e6af31c280cc099718045a7d8dae11\"" -j MASQUERADE

-N CNI-0184214f5ed59eb78be90c7d
-A POSTROUTING -s fd00:40:0:0:4::/80 -m comment --comment "name: \"bmbridge\" id: \"ae86be65aeda290d526f2bf59cd3ce4c5e94b8a28204f6485800352df06b971a\"" -j CNI-0184214f5ed59eb78be90c7d
-A CNI-0184214f5ed59eb78be90c7d -d fd00:40:0:0:4::/80 -m comment --comment "name: \"bmbridge\" id: \"ae86be65aeda290d526f2bf59cd3ce4c5e94b8a28204f6485800352df06b971a\"" -j ACCEPT
-A CNI-0184214f5ed59eb78be90c7d ! -d ff00::/8 -m comment --comment "name: \"bmbridge\" id: \"ae86be65aeda290d526f2bf59cd3ce4c5e94b8a28204f6485800352df06b971a\"" -j MASQUERADE

Not sure if this is a red-herring, but it seems like things are not getting cleaned up correctly (tests?).

pmichali commented 6 years ago

Found that, if kube-dns is running on the master node (of a three node cluster), instead of a minion, then there are no restarts seen during E2E test runs. Trying to understand what causes the communication break between kube-dns pod on minion and master node, during the E2E tests.

pmichali commented 6 years ago

I may have found a solution for the problem... I changed the MTU on the management/pod interfaces, and in the CNI config to 9000 (and set the ToR interfaces to 9216). With that, I have been able run the E2E tests twice w/o kube-dns restarting (nor transient test failures). I'm going to run the tests a few more times to see if this is consistent.

Guessing that the packets are large, and the retransmissions we are seeing are causing probes to timeout.

There still is the concern over the duplicate CNI entries, as it seems like, when kube-dns is killed and restarted, old entries are not removed.

pmichali commented 6 years ago

Did 6 runs of E2E tests (32 tests), and do not see DNS restarts, when the MTU is set to 9000. Monitored packets through a test run and there were about 1.5% of size 4128, 1% of size ~1000, 16% of size ~500, and 69% of packets size 32.

pmichali commented 6 years ago

Note: if kube-dns is running on the master node, where the tests were invoked from, we would not see any timeouts. When kube-dns is running on a minion, restarts were seen frequently, and sometimes causing test failures (as pod not up).

pmichali commented 6 years ago

More triaging, but not getting closer to a root cause yet...

On a bare-metal, three node cluster, running IPv6 K8s cluster, I can set the MTU for the interface on each node that is used for the pod and management network. When I do that, I do NOT see any kube-dns restarts during the E2E test runs.

On k-d-c (Docker-In-Docker), which is going to be used for CI of IPv6 E2E tests, I tried the MTU change. I did get five E2E test runs w/o kube-dns restarting, but on the sixth run, I started seeing kube-dns restarts (and tests were intermittently failing, because kube-dns wasn't ready)>

pmichali commented 6 years ago

Will help from a co-worker, I think we found the root cause for why kube-dns is restarting. As mentioned, TCP readiness messages all of a sudden start sending retransmissions, until it is finally restarted by kubelet, for not responding. We found that, intermittently, the MAC address of the bridge interface for the pod network is changing.

The theory is that, during the tests, as test pods are created/destroyed, a pod is created with a lower MAC than that which the bridge had from the beginning (it would have the MAC of the kube-dns pod, which is the first created). The bridge would change the MAC to match that lower MAC address, and changing the MAC underneath the TCP communication between kube-dns pod and kubelet on that node, would cause failures.

Have seen that Openstack had the same issue, and they worked around this by setting the MAC on the bridge (so it wouldn't be later changed, by a dynamically added I/F), by using the bridged interface that was used for their bridge.

We don't have a bridged interface for our bridge, but I can create a dummy veth pair and use that MAC to program the bridge. Will try that next week. For now, though, hard coding the MAC on the bridge has prevented this problem on three test suite runs so far...

pmichali commented 6 years ago

Talked to @squeed and it sounds like the latest CNI bridge plugin should assign a MAC to the bridge, and local testing shows that to be the case. The latest plugin (0.7.1+) should force a MAC, rather than using the dynamic MAC assignment. The plugin will also "adopt" a bridge that was created prior to CNI plugin startup.

However, it doesn't appear that the CNI plugin deals with a corner case of an existing bridge that has no MAC address (all zeros). In that case, it seems that the MAC gets dynamically assigned, based on enslaved I/Fs.

I gave a try at creating the bridge AFTER CNI config was set up, but the bridge still had a MAC of zeros. I created a second bridge, and that did get a MAC assigned. I'm not sure why the first bridge created did not get a MAC assigned (not enough time for bridge plugin to start up?).

If the CNI plugin could handle this corner case, it would be very helpful. In the meantime, I'll try to manually set a MAC on the bridge by using an interface from a dummy veth pair.

squeed commented 6 years ago

That was a fun problem to debug. Am I correct in recalling that brctl addbr created a bridge with a zero-mac, but ip link add did not?

pmichali commented 6 years ago

Unfortunately not. With both brctl addr and ip link add I see a zero MAC. So, if the bridge already exists, and has a zero MAC, then it uses dynamic MAC assignment, and causes issues.

There is still a problem, however, even with using 0.7.1 bridge plugin. I created the config, and then immediately created a bridge, and it had a zero MAC. I created a second bridge, and it had a MAC. Tried this with both commands.

Is there some delay, before the config is actually used?

Is there a way to tell, when the updated CNI plugin is ready?

pmichali commented 6 years ago

I had run E2E tests three times, using CNI 0.7.1 and creating the bridge right after CNI plugin config file is created. Each time, the bridge had zero, but then changed during the test run, getting pod MACs. Amazingly, I didn't see any kube-dns restarts, but on the fourth E2E run, kube-dns restarts occurred - so I must have been lucky on the first three.

pmichali commented 6 years ago

@squeed Here is a summary and some questions...

Using CNI 0.7.1, if I have the k-d-c startup script create the CNI config file and then immediately add a bridge, the bridge does not have a MAC, and later, it dynamically uses MAC from enslaved interfaces. If I create a second bridge (much later, but before any pods created), the bridge will have a MAC address.

Is there some delay, from when the CNI config is created, to when it will be used during bridge creates?

If so, is there a way we can tell (some condition) on when CNI is ready?

If I manually set a MAC on the bridge (using interface from veth pair), there is a MAC, and no problem exists in E2E tests.

With CNI 0.6.2 and 0.7.1, in IPv4 mode, the bridge created by k-d-c before the CNI config file is written, has a MAC address. The k-d-c script uses brctl addbr to add the bridge. I tried creating the bridge after the CNI config file is created (using ip link add) and it also had a MAC.

Any thoughts on why the IPv4 behavior is different (why it gets a MAC in either case)?

Lastly, when I tried to use CNI 0.7.1 in k-d-c, which bakes that into the image created for nodes, I had to use this repo (which is version specific):

https://github.com/containernetworking/plugins/releases/download/v0.7.1/cni-plugins-"${ARCH}"-v0.7.1.tgz

instead of this repo, which is normally used:

https://storage.googleapis.com/kubernetes-release/network-plugins/cni-plugins-"${ARCH}"-v0.6.2.tgz

Is this version not available yet in Kubernetes? IOW do I need to use this latter area for the time being?

pmichali commented 6 years ago

I'm making changes for k-d-c under https://github.com/kubernetes-sigs/kubeadm-dind-cluster/pull/135. Please review there.

If folks have responses for the questions, above, please let me know so I can adjust my changes, as needed.

pmichali commented 6 years ago

Fixed under https://github.com/kubernetes-sigs/kubeadm-dind-cluster/pull/144