kubernetes-retired / federation

[EOL] Cluster Federation
Apache License 2.0
210 stars 82 forks source link

federation: Creating a Type LoadBalancer federated service and federated ingress can lead to resource leak #214

Closed irfanurrehman closed 6 years ago

irfanurrehman commented 6 years ago

Issue by nikhiljindal Monday Sep 11, 2017 at 23:34 GMT Originally opened as https://github.com/kubernetes/kubernetes/issues/52315


Steps to repro: Create a federated service of type LoadBalancer and a federated ingress and then delete them. Expected: All GCP resources (health checks, firewall rules, instance groups, backend service, etc) should be deleted when service and ingress are deleted. Actual: GCP Health check and firewall rules are leaked sometimes.

Explanation: In kubernetes release 1.7, we updated the service controller to create health check and firewall rules whose names are generated using providerID/clusterID (providerID if it exists, else clusterID) and since providerID is set by federated ingress controller, service controller looks for a different name if ingress controller sets it after service controller created GCP resources. This race condition between the 2 controller leads to service controller leaking the original health check and firewall rule that it had created.

Possible fixes:

irfanurrehman commented 6 years ago

Comment by nikhiljindal Monday Sep 11, 2017 at 23:44 GMT


cc @marun (current on call) This is one of the reason why some our e2e tests are failing. We have been leaking the resources for a long time. We ran out of quota now and hence the tests started failing.

As a short term fix, @madhusudancs is going to update his "clean leaked resources" script to clean health checks and firewall rules so that our tests dont run out of quota, but we need a proper fix for this.

cc @kubernetes/sig-federation-bugs

irfanurrehman commented 6 years ago

Comment by madhusudancs Wednesday Sep 13, 2017 at 22:07 GMT


PR https://github.com/kubernetes/test-infra/pull/4545 is the short-term fix.

irfanurrehman commented 6 years ago

Comment by nikhiljindal Tuesday Oct 10, 2017 at 17:16 GMT


@kubernetes/sig-multicluster-feature-requests /sig multicluster

irfanurrehman commented 6 years ago

Comment by walteraa Tuesday Oct 10, 2017 at 17:27 GMT


@nikhiljindal The same is happening to DNS entries. I think when a federated service is deleted, all entries referring to this service should be deleted as well. What do you think?

irfanurrehman commented 6 years ago

Comment by nikhiljindal Tuesday Oct 10, 2017 at 18:49 GMT


@walteraa Please feel free to file an issue for that. Please include steps to repro if you are able to repro it consistently.

irfanurrehman commented 6 years ago

Comment by fejta-bot Thursday Jan 11, 2018 at 17:36 GMT


Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

irfanurrehman commented 6 years ago

Comment by kinghrothgar Thursday Jan 11, 2018 at 20:38 GMT


I believe we maybe running into this. When we spin up a federated ingress-gce, one of the 4 regions randomly ends up with in an extra unused backend:

2018-01-11_15-19-45

One region also always randomly gets set to CPU Utilization instead of Rate:

2018-01-11_15-19-32

Each time it is a different region. We are running 1.8.5-gke.0.

irfanurrehman commented 6 years ago

Comment by cmluciano Friday Jan 12, 2018 at 18:00 GMT


@kubernetes/sig-gcp-bugs

irfanurrehman commented 6 years ago

Comment by nikhiljindal Friday Jan 12, 2018 at 20:00 GMT


fwiw, you can now try out kubemci, a command line tool to setup multi cluster ingresses: https://github.com/GoogleCloudPlatform/k8s-multicluster-ingress.

irfanurrehman commented 6 years ago

cc @nikhiljindal

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten /remove-lifecycle stale

fejta-bot commented 6 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close