Intermittent connectivity issues with K8s API servers

toland commented 6 years ago

We have an intermittent connectivity issue with the K8s API servers. This shows up as long latencies and connection failures.

toland commented 6 years ago

Note that the internal API load balancer uses CZLB. That might have something to do with it.

toland commented 6 years ago

I have disabled CZLB on the internal Tectonic LBs. I will keep an eye on the dashboard and see if the connectivity issues persist.

toland commented 6 years ago

Once Bernard set up PD we got a much clearer view of these failures. They happened regularly with CZLB turned off, so that was not the issue. I have re-enabled CZLB on the internal Kube LBs.

One interesting note is that the incidents in PD have all resolved themselves with 5, 10, or 15 minutes. This sounds suspiciously like a timeout issue since 5 minutes is a common timeout value. The ELB idle timeout is 6 minutes and has a connection draining period of 5 minutes.

Of course, it could also be a polling interval. Prometheus Alerts has a config setting resolve_timeout: 5m, which might also explain the timing. In fact, that probably does explain it.

I have worked on a few theories, but nothing has proven conclusive. So far, I have ruled out more than I have ruled in.

First is the CZLB issue. This appears to have been ruled out.

It could be a problem with the EC2 instances themselves. Looking at the metrics for the EC2 instances, I can see that there are periodic status check failures for the master nodes. I can't find a reason for these, and whatever causes them goes away on its own. While it is possible this is a contributing factor (or another symptom of the same problem), it can't explain all of the failures we have seen.

Unfortunately, metrics collection was not turned on for the master ASG, so I couldn't tell if there was anything strange going on there. I have enabled detailed metrics for all of the ASGs going forward.

It could be another issue with the load balancers. The ELB metrics show corresponding spikes in the number of unhealthy nodes. There are also spikes in request count and request latency around the same times, which would lead me to suspect an application level problem. But, that doesn't make sense when you consider that there is also EC2 instance check failures.

Another possibility is that the API server pods are misbehaving. However, I can't find any evidence of this. Looking at the master nodes in Tectonic, they appear to be healthy and very lightly loaded. The pods themselves don't have any resource usage limits, and also appear to be healthy. Looking at their logs, I see two types of events of interest:

W1206 18:18:43.130915       5 controller.go:386] Resetting endpoints for master service "kubernetes" to &{{ } {kubernetes  default /api/v1/namespaces/default/endpoints/kubernetes ac13caaf-7945-11e7-9d64-064a84c8a13c 72692125 0 2017-08-04 18:49:40.477200484 +0000 UTC <nil> <nil> map[component:apiserver provider:kubernetes] map[] [] nil [] } [{[{10.1.82.189  <nil> <nil>}] [] [{https 443 TCP}]}]}

This is a known issue that is resolved in Kubernetes 1.9. I haven't been able to find a workaround in the meantime. I don't think it is causing any issues, but I don't know that for sure.

E1206 17:52:36.089538       5 watcher.go:210] watch chan error: etcdserver: mvcc: required revision has been compacted
W1206 17:52:36.089840       5 reflector.go:323] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:72: watch of *extensions.ThirdPartyResource ended with: etcdserver: mvcc: required revision has been compacted

Note that this event generates both an error and a warning. I don't think that this would cause the issues we are seeing, but more research is called for.

It could be related to the T2 instance type. Both etcd and the Kubernetes/Tectonic teams recommend m4.large instances. Looking at the CW metrics I can't see any evidence of throttling, but this stuff is complicated and I am still learning the intricacies of EC2. I already have plans to upgrade all of the instances to m4.large, but I don't have a strong feeling that it will resolve these issues.

Finally, it is possible that there is a gremlin in the system. We could have a noisy neighbor problem or there might be a dodgy switch port. Network or CPU congestion might cause something like what we are seeing now. A noisy neighbor might explain the instance status check failures. This just doesn't feel right to me, but I can't rule it out.

There is nothing in Loggly that appears to be related to these issues, though we do have a needle-in-haystack problem there.

I have decided to enable logging on the load balancers and see if that proves enlightening. Other than that, I am a bit stuck as to where to go next. Maybe a packet dump?

Version 1.8 of Tectonic will land soon. It includes Kubernetes 1.8 and a number of operational changes. There is some possibility that upgrade and/or the server upgrades to m4.larges will resolve the issue. But I don't feel great about the "upgrade everything and cross your fingers" approach. However, that might be the "right thing" when treating servers as cattle.

bengtan commented 6 years ago

Keep up the good work, guys.

bernardd commented 6 years ago

I've posted a question here: https://github.com/coreos/tectonic-forum/issues/236 about it. I didn't include much of what Phil found above because I didn't want to get it bogged down in potentially irrelevant minutiae. I'm hoping they can at least point us in the right direction for where to look.

bengtan commented 6 years ago

Related?

coreos/prometheus-operator#724

This is unfortunately a problem with HA Kubernetes apiserver implementation itself. The way Prometheus discovers targets in Kubernetes is through Endpoints objects. When there are multiple highly available apiservers, they keep overwriting each other causing Prometheus to not be able to scrape the apiservers. This has been fixed upstream, however, will only land in the 1.9 release. For the time being I recommend to silence the alert in the Alertmanager.

bengtan commented 6 years ago

Maybe another related?

coreos/tectonic-forum#230

(I searched google for 'K8SApiserverDown')

toland commented 6 years ago

@bengtan Interesting. I knew about the API servers overwriting the endpoint. I have been tracking that issue for some time, but I didn't realize that it would impact Prometheus in this way.

We definitely saw other evidence of connectivity issues within the cluster. At this point, I can't say whether they may be caused by the same HA API server issue or just coincidental. I think for now I will silence the alarm in Prometheus for one month and keep an eye on the cluster for other issues. We can revisit the status of the bug in one month when the silence expires.

hippware / wocky

Intermittent connectivity issues with K8s API servers #1164