Open astorath opened 6 years ago
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
We are also experiencing this issue. If the node pool running the ingress controllers (or in this case, Istio gateways) is preemptible; the nodes are removed from the generated instance groups and the replacement nodes are not added to the IGs when they do come back online. We experience this in our development clusters where all nodes are preemptible.
/reopen (not sure if this will work)
/remove-lifecycle rotten /reopen
@astorath: Reopened this issue.
/lifecycle frozen /remove-lifecycle stale /remove-lifecycle rotten
Here is my theory:
I've had a preemptible GKE cluster (1.12.9-gke.7) running for almost a week and have not been able to reproduce this issue.
For those who've experienced this issue, does it still happen on newer versions of Kubernetes? On average, how many days before this issue is seen?
A possible solution is to check that the Node bootIDs are still the same in nodeSlicesEqualForLB
, as preempted Nodes come back with a new bootID.
@bradhoekstra I had hoped this issue was quietly resolved but we had one of our GKE clusters (v1.12.7-gke.25) experience this last week. Previously we had experienced this issue with much greater frequency though it was still intermittent. Some weeks we'd have multiple clusters with this issue and other weeks it might just happen once or not at all. Sometimes I'd notice that a load balancer had lost all but one of its nodes so while the issue was still occurring, it had not yet caused a noticeable outage as ingress was still working over the one attached node.
I just wanted to report that the issue is gone, but it's not... Just seen a missing node (one node had disappeared from the group) in the cluster...
Alas after several months of not seeing this issue and thinking it was finally resolved, we've had two clusters just experience the problem. Both clusters are on GKE version v1.12.10-gke.15
Holy old version Batman! Upgrade your dev cluster stacks, @cuzzo333. ;)
Still experiencing the issue on GKE 1.14.9-gke.2
Also hitting this on v1.14.10-gke.24.
Still getting these error for preemptible nodepool on GKE 1.15.9-gke.24 with Internal Load balancer
/cc @jpbetz /triage accepted
I believe I've found a unexpected behavior related with this issue in the function ensureInternalInstanceGroup:
It compares the contents of gceNodes and kubeNodes for Nodes add/removal operations on Instance Group. The former is obtained from GCE Instance Group, the latter from Kubernetes.
However, each node may be represented differently on each set: kubeNodes coming with the FQDN (or whichever is populated in its /etc/hostname) while gceNodes always has only the instance name as shown on gce. Thus, none of the nodes in one set is found in the other.
What happens next: all the current nodes (gceNodes) are removed from the InstanceGroup, then all members of the updated node list (kubeNodes) are added to the same Instance Group.
On the Kubernetes clusters we are running, it happened for every service using ILB, even though the InstanceGroup is the same for all of them.
This problem becomes very visible on clusters with a large amount of ILB services, as every time a node is added or removed from the cluster, all those services are updated, one at a time.
The function ensureInternalInstanceGroup remains unchanged so far. I could not confirm if it stills in use in the newer Kubernetes releases, or if it is now called just once after a change on the number of nodes, as we still running an older version.
Meanwhile, we are getting around this issue by ensuring the exactly same values in /etc/hostname and instance name.
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
@bowei for triage
Is this a BUG REPORT or FEATURE REQUEST?:
What happened: We are using GKE with internal load balancer
cloud.google.com/load-balancer-type: "internal"
and preemptible instances. Sometimes, when instances are recreated, new instance is not added to google load balancer. We have nginx ingress behind this balancer, so cluster loses it's ingress in that case.What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
--preemptible
flagcloud.google.com/load-balancer-type: "internal"
Anything else we need to know?: My assumption is, that this happens when instance with ingress controller goes down, but I'm not sure. I have not found any traces in GKE/GCE logs with errors of any king to debug this.
Google uses its own instance group to manage internal load balancer. When this happens, the instance group is lacking an instance (e.g. 2/2 instead of 3/3 instances are available). When this happens several times in a row, instance group loses all instances (0/0 is available).
Environment:
kubectl version
):uname -a
): ?/sig gcp /sig node /sig cloud-provider