kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.13k stars 39.68k forks source link

GCE Internal Load Balancer loses nodes #69362

Open astorath opened 6 years ago

astorath commented 6 years ago

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug /kind feature

What happened: We are using GKE with internal load balancer cloud.google.com/load-balancer-type: "internal" and preemptible instances. Sometimes, when instances are recreated, new instance is not added to google load balancer. We have nginx ingress behind this balancer, so cluster loses it's ingress in that case.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

  1. Create GKE cluster with --preemptible flag
  2. Add nginx ingress with cloud.google.com/load-balancer-type: "internal"
  3. ...
  4. Wait?

Anything else we need to know?: My assumption is, that this happens when instance with ingress controller goes down, but I'm not sure. I have not found any traces in GKE/GCE logs with errors of any king to debug this.

Google uses its own instance group to manage internal load balancer. When this happens, the instance group is lacking an instance (e.g. 2/2 instead of 3/3 instances are available). When this happens several times in a row, instance group loses all instances (0/0 is available).

Environment:

/sig gcp /sig node /sig cloud-provider

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/69362#issuecomment-468902568): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
dmccaffery commented 5 years ago

We are also experiencing this issue. If the node pool running the ingress controllers (or in this case, Istio gateways) is preemptible; the nodes are removed from the generated instance groups and the replacement nodes are not added to the IGs when they do come back online. We experience this in our development clusters where all nodes are preemptible.

dmccaffery commented 5 years ago

/reopen (not sure if this will work)

astorath commented 5 years ago

/remove-lifecycle rotten /reopen

k8s-ci-robot commented 5 years ago

@astorath: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/69362#issuecomment-478906035): >/remove-lifecycle rotten >/reopen > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
thockin commented 5 years ago

/lifecycle frozen /remove-lifecycle stale /remove-lifecycle rotten

freehan commented 5 years ago

Here is my theory:

  1. a ILB is setup via loadbalancer type service. Service controller creates ILB. It targets an unmanaged instance group that include all the cluster nodes.
  2. One VM got preempted. I think GCE will remove the VM from instance groups.
  3. If the following conditions apply: A. VM comes back very quickly (In ~40 seconds, before the node got marked NotReady. Before service controller periodic node sync) B. VM got assigned with the same Pod CIDR (if different, node object will be deleted) The node object of the VM preserves
  4. If there is no GCE ingress in the cluster, only service controller manages the instance group. When service controller runs periodic node sync, it hits this (https://github.com/kubernetes/kubernetes/blob/release-1.15/pkg/controller/service/service_controller.go#L636) It does not see node difference and there is no service to update while the VM has been deleted from the instance group.
  5. If all the cluster node got removed or service has externalTrafficPolicy=local, then service get disruption.
bradhoekstra commented 5 years ago

I've had a preemptible GKE cluster (1.12.9-gke.7) running for almost a week and have not been able to reproduce this issue.

For those who've experienced this issue, does it still happen on newer versions of Kubernetes? On average, how many days before this issue is seen?

A possible solution is to check that the Node bootIDs are still the same in nodeSlicesEqualForLB, as preempted Nodes come back with a new bootID.

cuzzo333 commented 5 years ago

@bradhoekstra I had hoped this issue was quietly resolved but we had one of our GKE clusters (v1.12.7-gke.25) experience this last week. Previously we had experienced this issue with much greater frequency though it was still intermittent. Some weeks we'd have multiple clusters with this issue and other weeks it might just happen once or not at all. Sometimes I'd notice that a load balancer had lost all but one of its nodes so while the issue was still occurring, it had not yet caused a noticeable outage as ingress was still working over the one attached node.

astorath commented 5 years ago

I just wanted to report that the issue is gone, but it's not... Just seen a missing node (one node had disappeared from the group) in the cluster...

cuzzo333 commented 5 years ago

Alas after several months of not seeing this issue and thinking it was finally resolved, we've had two clusters just experience the problem. Both clusters are on GKE version v1.12.10-gke.15

dmccaffery commented 5 years ago

Holy old version Batman! Upgrade your dev cluster stacks, @cuzzo333. ;)

hypnoglow commented 4 years ago

Still experiencing the issue on GKE 1.14.9-gke.2

chrisob commented 4 years ago

Also hitting this on v1.14.10-gke.24.

thanatchakromsang commented 4 years ago

Still getting these error for preemptible nodepool on GKE 1.15.9-gke.24 with Internal Load balancer

cheftako commented 3 years ago

/cc @jpbetz /triage accepted

ajsfidelis commented 2 years ago

I believe I've found a unexpected behavior related with this issue in the function ensureInternalInstanceGroup:

It compares the contents of gceNodes and kubeNodes for Nodes add/removal operations on Instance Group. The former is obtained from GCE Instance Group, the latter from Kubernetes.

However, each node may be represented differently on each set: kubeNodes coming with the FQDN (or whichever is populated in its /etc/hostname) while gceNodes always has only the instance name as shown on gce. Thus, none of the nodes in one set is found in the other.

What happens next: all the current nodes (gceNodes) are removed from the InstanceGroup, then all members of the updated node list (kubeNodes) are added to the same Instance Group.

On the Kubernetes clusters we are running, it happened for every service using ILB, even though the InstanceGroup is the same for all of them.

This problem becomes very visible on clusters with a large amount of ILB services, as every time a node is added or removed from the cluster, all those services are updated, one at a time.

The function ensureInternalInstanceGroup remains unchanged so far. I could not confirm if it stills in use in the newer Kubernetes releases, or if it is now called just once after a change on the number of nodes, as we still running an older version.

Meanwhile, we are getting around this issue by ensuring the exactly same values in /etc/hostname and instance name.

k8s-triage-robot commented 1 year ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
thockin commented 1 year ago

@bowei for triage