federation: Creating a Type LoadBalancer federated service and federated ingress can lead to resource leak

irfanurrehman commented 6 years ago

Issue by nikhiljindal Monday Sep 11, 2017 at 23:34 GMT Originally opened as https://github.com/kubernetes/kubernetes/issues/52315

Steps to repro: Create a federated service of type LoadBalancer and a federated ingress and then delete them. Expected: All GCP resources (health checks, firewall rules, instance groups, backend service, etc) should be deleted when service and ingress are deleted. Actual: GCP Health check and firewall rules are leaked sometimes.

Explanation: In kubernetes release 1.7, we updated the service controller to create health check and firewall rules whose names are generated using providerID/clusterID (providerID if it exists, else clusterID) and since providerID is set by federated ingress controller, service controller looks for a different name if ingress controller sets it after service controller created GCP resources. This race condition between the 2 controller leads to service controller leaking the original health check and firewall rule that it had created.

Possible fixes:

Service controller should wait for providerID/clusterID to by synced before creating any GCP resources (this will not work in environments where ingress controller is disabled).
Service controller should have the same logic of syncing providerID and clusterID across clusters as ingress controller (need to ensure that the 2 controllers dont fight with each other).

irfanurrehman commented 6 years ago

Comment by nikhiljindal Monday Sep 11, 2017 at 23:44 GMT

cc @marun (current on call) This is one of the reason why some our e2e tests are failing. We have been leaking the resources for a long time. We ran out of quota now and hence the tests started failing.

As a short term fix, @madhusudancs is going to update his "clean leaked resources" script to clean health checks and firewall rules so that our tests dont run out of quota, but we need a proper fix for this.

cc @kubernetes/sig-federation-bugs

irfanurrehman commented 6 years ago

Comment by madhusudancs Wednesday Sep 13, 2017 at 22:07 GMT

PR https://github.com/kubernetes/test-infra/pull/4545 is the short-term fix.

irfanurrehman commented 6 years ago

Comment by nikhiljindal Tuesday Oct 10, 2017 at 17:16 GMT

@kubernetes/sig-multicluster-feature-requests /sig multicluster

irfanurrehman commented 6 years ago

Comment by walteraa Tuesday Oct 10, 2017 at 17:27 GMT

@nikhiljindal The same is happening to DNS entries. I think when a federated service is deleted, all entries referring to this service should be deleted as well. What do you think?

irfanurrehman commented 6 years ago

Comment by nikhiljindal Tuesday Oct 10, 2017 at 18:49 GMT

@walteraa Please feel free to file an issue for that. Please include steps to repro if you are able to repro it consistently.

irfanurrehman commented 6 years ago

Comment by fejta-bot Thursday Jan 11, 2018 at 17:36 GMT

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

irfanurrehman commented 6 years ago

Comment by kinghrothgar Thursday Jan 11, 2018 at 20:38 GMT

I believe we maybe running into this. When we spin up a federated ingress-gce, one of the 4 regions randomly ends up with in an extra unused backend:

2018-01-11_15-19-45

One region also always randomly gets set to CPU Utilization instead of Rate:

2018-01-11_15-19-32

Each time it is a different region. We are running 1.8.5-gke.0.

irfanurrehman commented 6 years ago

Comment by cmluciano Friday Jan 12, 2018 at 18:00 GMT

@kubernetes/sig-gcp-bugs

irfanurrehman commented 6 years ago

Comment by nikhiljindal Friday Jan 12, 2018 at 20:00 GMT

fwiw, you can now try out kubemci, a command line tool to setup multi cluster ingresses: https://github.com/GoogleCloudPlatform/k8s-multicluster-ingress.

irfanurrehman commented 6 years ago

cc @nikhiljindal

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten /remove-lifecycle stale

fejta-bot commented 6 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

kubernetes-retired / federation

federation: Creating a Type LoadBalancer federated service and federated ingress can lead to resource leak #214