Blocking network access to workload clusters causes reconcile queue to grow

vignesh-goutham commented 1 year ago

What steps did you take and what happened?

I have a management cluster managing 10 workload clusters. These are tiny sized clusters with 1 control-plane and 1 worker nodes. I blocked network traffic from the management cluster to 4 of the 10 workload clusters. All the controllers take ~30 seconds to invalidate cache, etc and reflect the CR status which all range from mhc failing, kcp unavailable, etc. I do see in the logs that controller runtime timeouts trying to create a client + cache for the remote cluster that are blocked.

E0316 21:50:09.719081       1 controller.go:326] "Reconciler error" err="error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"eksa-system/vgg-cloudstack-b\": Get \"https://10.80.180.51:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" machine="eksa-system/vgg-cloudstack-b-nlwqk" namespace="eksa-system" name="vgg-cloudstack-b-nlwqk" reconcileID=bf9d8be2-13d8-4b55-a070-ed10249e0443

The clusters at this state are actually fine, all reconcile loops work as expected for the clusters that do have network connectivity to the management cluster.

Now, if I try to create another workload cluster, thats when things get interesting. I noticed, all controllers take a long time to create their respective CRs, and also to update their status. For example, creating a new cluster takes about 5 mins in normal conditions, but with this network block in place, it takes upwards of 30 mins. It takes about 5 mins for a machine to go from pending to provisioned even after a provider ID has been assigned after 1 min. I've observed this become worse as more clusters loose connectivity with the management cluster. The timeouts for creating the client is 10 seconds, and this quickly adds up with multiple clusters having their connectivity blocked.

I have some metrics I pulled off the controllers in grafana that I've attached here. Please note, the queue depth and unfinished work stays at 0 without any network blocks. Look how both of those spike up in this chart below.

I noticed that there are concurrency flags for each controller which all default to 10. I agree setting this to some high number depending on the environment could help in this situation. That said, I think such a network block might occur due to incorrect network policies in a prod environment or other reasons, and that shouldn't cause issues on other cluster operations like create/upgrade/delete, and it might be hard to estimate the concurrency number required for an env that has dynamic number of clusters.

I have 2 suggestions

Can we add an argument to expose and setup controller ratelimiting
Can we expose arguments to expose client config, like timeout.

I'd prefer rate limiting over option 2 though.

I also tried pausing (spec.pause) all the clusters that had the network blocks, it took sometime for the controller to finish the jobs queued up, but any operation once the queue cleared, didnt spike up the queue depth. Create cluster was smooth as it was before network block.

Notice how the queue dropped to 0 and stayed there after the pause on spec.

What did you expect to happen?

Expose a rate limiting option that could stop the queue from growing in situations like this.

Cluster API version

1.2.0

Kubernetes version

1.23

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

killianmuldoon commented 1 year ago

This is awesome work @vignesh-goutham! There have been a number of improvements since v1.2.0 in the ClusterCacheTracker. Those issues could account for this problem.

Would you be able to test this with a more up-to-date version of CAPI?

/triage accepted

vignesh-goutham commented 1 year ago

Thanks. Sure I can test with 1.3.4 CAPI. I'll post my findings once I have something.

sbueringer commented 1 year ago

@vignesh-goutham Thx for the extensive analysis. Specifically this change (https://github.com/kubernetes-sigs/cluster-api/pull/7537) which was also backported to v1.2.x might have already resolved this issue or at least improved the behavior.

I would recommend v1.3.5 instead of v1.3.4. There were 2 more fixes in ClusterCacheTracker there (https://github.com/kubernetes-sigs/cluster-api/releases/tag/v1.3.5) (they shouldn't impact this behavior, but are definitely nice2have)

vignesh-goutham commented 1 year ago

Sorry for the late reply. So I upgraded CAPI to 1.3.5. Its so much better than 1.2.0. I ran some more tests with this setup to try and see if I can simulate a more stressed situation. The machine controller queue grew a bit (still not as worse as with 1.2.0) when I ran 2 workload cluster create and 1 workload cluster upgrade in parallel. I did observe some delay in the order of 2 extra mins vs what it took when running without blocking network access to 4 workload clusters (out of 11)

Please ignore the dashboard name pointing to 1.3.4. The CAPI version was verified to be 1.3.5.

I let the clusters sit over the weekend, and saw something very interesting. This setup has 1 management cluster, 13 workload clusters with 4 workload clusters blocked network access to management cluster. The machine controller queue on occasions shot up to 45 depth. I had verified that no new machines were rolled out and the environment was pretty stable as well. Any idea on why this might have occurred? I still think implementing a controller ratelimiter would be beneficial, especially when situation like this has potential to get aggravated with lot more clusters a production environment would run, say upwards of 50 workload clusters. Please take a look at the chart below.

I'd love to hear any other suggestions to improve this condition as well. I can take a shot at implementing it.

sbueringer commented 1 year ago

Maybe it's just because all Machines are regularly reconciled. Either because of the syncPeriod or because they are hitting a code path which requeues (e.g. there is one when we are not able to create a client to access the workload cluster, https://github.com/kubernetes-sigs/cluster-api/blob/c21b898de92f16a514835d5d05e4d2a3a98c11aa/internal/controllers/machine/machine_controller.go#L225-L229)

I wonder what logs you are seeing at the time of the spikes. I also wonder if there are a lot more spikes and they are just not visible in monitoring because the periodic metric scrapes don't detect spikes which happen in between the scrapes.

fabriziopandini commented 6 months ago

/close versions discussed in this issue are now out of support. it will be great to repeat the same test with new releases (great work btw) + eventually report back on a new issue

k8s-ci-robot commented 6 months ago

@fabriziopandini: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/8306#issuecomment-2050380934): >/close >versions discussed in this issue are now out of support. >it will be great to repeat the same test with new releases (great work btw) + eventually report back on a new issue Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api