ELB IP changes can bring the cluster down

danielfm commented 7 years ago

I ran into https://github.com/kubernetes/kubernetes/issues/41916 twice in the last 3 days in my production cluster, with almost 50% of worker nodes transitioning to NotReady state almost simultaneously in both days, causing a brief downtime in critical services due to Kubernetes default (and agressive) eviction policy for failing nodes.

I just contacted AWS support to validate the hypothesis of the ELB changing IPs at the time of both incidents, and the answer was yes.

My configuration (multi-node control plane with ELB) matches exactly the one in that issue, and probably most kube-aws users are subject to this.

Have anyone else ran into this at some point?

danielfm commented 7 years ago

I just experienced another issue with ELBs that the script proposed by @mumoshu apparently isn't able to circumvent, which is the failure of one or more ELB nodes. When this happened, the DNS records for the affected ELB did not change (as reported by the systemd unit logs), which makes the script unable to recover the affected kubelets.

Moving to NLBs (#937) might help solve these issues altogether.

redbaron commented 7 years ago

I guess nothing can help if part of underneath infrastructure fails, same thing can probably happen to NLB

danielfm commented 7 years ago

Probably, but then AWS won't be able to point their fingers back at us when something like this happens.

whereisaaron commented 6 years ago

AWS's NLB replacement for ELB's using one IP per zone/subnet and those IP be your EIP's. So using this new product you can get a LB with a set of fixed IPs that won't change.

http://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html

AWS new ALB (for HTTPS) and NLB (for TCP) seem to AWS's next-gen replacement for the old ELB, which AWS now calls 'Classic Load Balancers'. k8s and kube-aws should probably look to transition to the new products, which also appear have some advantages, such as fixed IP's - as I see #937 and #945 are doing! 🎉

mumoshu commented 6 years ago

@whereisaaron Thanks for the suggestion! I agree with your point. Anyway, please let me also add that ALB was experimented in #608 and decided as not appropriate for an K8S API load balancer.

rodawg commented 6 years ago

Unfortunately NLBs don't support VPC Peering on AWS, so some users (including me) will need to use Classic ELBs in conjunction with NLBs to support kubectl commands.

stephbu commented 6 years ago

Yes we see this today in production, and experienced player impact yesterday from this exact issue. Working with AWS support we reproduced the issue by forcing a scaledown on the API ELB for one of our integration clusters. All worker nodes went stale and workloads were evicted before nodes recovered at the 15min mark after the scaling event.

We confirmed that the DNS was updated almost immediately. We're going with the Kubelet restart tied to DNS change for the time being, but IMHO this is not a good long-term fix.

javefang commented 6 years ago

Seen this today. Our set up use Consul DNS for kubelet to discover the apiserver, which means the apiserver DNS name are multiple A-record pointing to the exact IP addresses of the apiservers, which changes every time an apiserver node is replaced.

In our case the workers come back eventually but it took a long while. My feeling is kubelet is not really respecting DNS TTLs as all Consul DNS names have TTL set to 0. Can anyone confirm?

mumoshu commented 6 years ago

Thanks everyone. At this point, would the only possible, universal work-around be the one shared by @roybotnik? // At least mine won't work with @javefang's case of course.

mumoshu commented 6 years ago

I was in the impression that since some k8s version kubelet has implemented the clide-side timeout to mitigate this issue, but can't remember the exact github issue right now.

javefang commented 6 years ago

I noticed that after the master DNS record changed the underlying IP, all kubelet instances fail for exactly 15min. (Our master DNS TTL is 0). When it fails we get the following error.

Nov 15 13:08:08 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:08:08.638348   11954 kubelet_node_status.go:390] Error updating node status, will retry: error getting node "dev-kubeworker-gen-0": Get https://apiserver-gen.service.dev.winton.consul/api/v1/nodes/dev-kubeworker-gen-0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

It recovered by its own without restarting after 15min (sharp). It feels more like kubelet (or the apiserver client used) is caching the DNS. I'm trying to pin-point the exact line of code which caused this behaviour. But anyone know the code-base better might be able to confirm this?

javefang commented 6 years ago

Seeing the following messages right before the worker came back. Last failure was at 13:15:18, then it reported some watch error (10.106.102.105 was the previous master which got destroyed) and re-resolved the DNS name before the cluster report the worker as "Ready" again! Maybe this is related to kubelet watch on apiserver not being dropped quick enough when the apiserver endpoint becomes unavailable?

Nov 15 13:15:16 dev-kubeworker-gen-0 kubelet[11954]: I1115 13:15:16.994725   11954 qos_container_manager_linux.go:320] [ContainerManager]: Updated QoS cgroup configuration
Nov 15 13:15:18 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:15:18.670478   11954 kubelet_node_status.go:390] Error updating node status, will retry: error getting node "dev-kubeworker-gen-0": Get https://apiserver-gen.service.dev.test.consul/api/v1/nodes/dev-kubeworker-gen-0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:15:22.640287   11954 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.106.102.104:49178->10.106.102.105:443: read: no route to host
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:15:22.640410   11954 kubelet_node_status.go:390] Error updating node status, will retry: error getting node "dev-kubeworker-gen-0": Get https://apiserver-gen.service.dev.test.consul/api/v1/nodes/dev-kubeworker-gen-0: read tcp 10.106.102.104:49178->10.106.102.105:443: read: no route to host
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:15:22.640943   11954 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.106.102.104:49178->10.106.102.105:443: read: no route to host
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: E1115 13:15:22.641445   11954 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.106.102.104:49178->10.106.102.105:443: read: no route to host
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: W1115 13:15:22.663747   11954 reflector.go:334] k8s.io/kubernetes/pkg/kubelet/kubelet.go:413: watch of *v1.Service ended with: too old resource version: 1883 (16328)
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: W1115 13:15:22.665010   11954 reflector.go:334] k8s.io/kubernetes/pkg/kubelet/kubelet.go:422: watch of *v1.Node ended with: too old resource version: 16145 (16328)
Nov 15 13:15:22 dev-kubeworker-gen-0 kubelet[11954]: W1115 13:15:22.665806   11954 reflector.go:334] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: watch of *v1.Pod ended with: too old resource version: 5602 (16345)

Found a possible line of code which explains the 15min behaviour

https://github.com/kubernetes/kubernetes/blob/fc8bfe2d8929e11a898c4557f9323c482b5e8842/pkg/kubelet/kubeletconfig/watch.go#L44

whereisaaron commented 6 years ago

It seems like there is a problem. If the controller DNS entry has a 30 second TTL, the kubelet should be able to recover from an IP change within 30s + update period, so about 40s. @javefang you think the kubelet is using this long, up to 15 minute back-off when the old IP goes stale? And so not a DNS caching problem, but rather it just stops trying to update the controller for several minutes?

For AWS at least, an NLB using fixed EIP addresses would mostly obviate the IP address every changing I think? Even if you recreate or move the LB, you can reapply the EIP so nothing changes. However an extra wrinkle is we would want worker nodes in multi-AZ clusters to use the EIP for the NLM endpoint in the same AZ. NLB's have one EIP per AZ as I understand it?

We saw a similar issue a couple time where the workers couldn't contact the controllers for ~2 minutes (no IP address change involved). Even though well less than the 5 minutes eviction time, everything got evicted anyway. Maybe the same back-off issue?

javefang commented 6 years ago

@whereisaaron yep this is indeed taking 15min for kubelet to recover. I have reproduced it with the following setup:

OS: Centos 7.4 (SELinux on)
Docker: 1.12.6
K8S: 1.8.3
Apiserver: 3 instances running on separate VMs
Apiserver DNS names: all 3 registered as Consul Services, which does DNS round-robin for them (dig apiserver.service.consul will show 3 IPs, pointing to the VMs running the apiserver)

To reproduce:

Destroy the VM running apiserver 1
Create a new VM to replace (this will get a different IP)
30% of the worker nodes goes into "NotReady" state, kubelet prints error message kubelet_node_status.go:390] Error updating node status, will retry: error getting node "dev-kubeworker-gen-0": Get https://apiserver-gen.service.dev.winton.consul/api/v1/nodes/dev-kubeworker-gen-0: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Repeat 1-3 for the other 2 apiservers
Now all workers should be in "NotReady" state
Wait 15min, kubelet on workers print the unable to decode an event from the watch stream and read: no route to host message before coming back to "Ready" state

I'm just curious about the mechanism in kubelet that can cause kubelet to be broken for 15min after any apiserver IP changes. We are deploying this on-premise. Tomorrow I'll try to put the 3 apiservers behind a load balancer with fixed IP to see if that fixes the issue.

javefang commented 6 years ago

UPDATE: putting all apiservers behind a load balancer (round-robin) with a static IP fixed it. Now all workers work fine even if I replace one of the master node. So using fixed IP load balancer would be my workaround for now. But do you think it's still worth investigating by kubelet doesn't respect apiserver's DNS TTL?

RyPeck commented 6 years ago

I believe the 15 minute window break many of us are experiencing is described in https://github.com/kubernetes/kubernetes/issues/41916#issuecomment-312428731. Reading through issues and pull requests, I don't see where a TCP Timeout was implemented on the underlying connection. The timeout on the HTTP request definitely was implemented.

frankconrad commented 6 years ago

All there work around the real problem, that the connections are keep forever. If we limit the livetime of the connection by time the problem would be not happened. Or at lest by nr or handled requests. Also we would get a better load distribution, because new connections allow loadblancer todo new distribution.

liggitt commented 6 years ago

the connections are keep forever. If we limit the livetime or an connection by time the problem would be not happened.

They don't live forever, they live for the operating system TCP timeout limit (typically 15 minutes by default)

danielfm commented 6 years ago

I haven't seen this happening anymore in some of the latest versions of Kubernetes 1.8.x (and I suspect the same is true for newer versions as well), so maybe we can close this?

frankconrad commented 6 years ago

Yes and this 15 min are to long for many cases, like here. The dead connection from elb/alb when there get terminated after there are 6 days depricated, mean not visible in dns any more. If we would reconnect every hour (or 10 min) we would not have the problem. And would get as site effect better load distribution. But still would have all benefits from keepalive. What have done here is workaround the real problem, that no dynamic cloud based loadblancer can good handle long live connections good. The problem need to fixed on http connection handling pooling too, as the higher level there is no real influence of connection resuse if you use pool feature.

liggitt commented 6 years ago

The fix merged into the last several releases of kubernetes was to drop/reestablish the apiserver connections from the kubelet if the heartbeat times out twice in a row. Reconnecting every 10 minutes or every hour would still let nodes go unavailable.

frankconrad commented 6 years ago

What seen in other go projects, if you use pooling and frequently sent request that keepalive idle timeout get not reached you run into this issue. If you disable pooling and make only one request per connection, you have not that issue. But higher latency and overhead, this why keepalive make sense.

By the way, the old Apache httpd had not only keepalive idle timeout but also keepalive max request count. Which helped a lot in many of this problems.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-incubator/kube-aws/issues/598#issuecomment-505131166): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-retired / kube-aws

ELB IP changes can bring the cluster down #598