Slow cluster creation when working with hundreds of clusters

lentzi90 commented 1 year ago

This is not really a bug, more like a performance issue. We hope to scale CAPI to thousands of workload clusters managed by a single management cluster. Before doing this with real hardware we are trying to check that the controllers can handle it. For a single workload cluster with 1000 Machines, this was no issue at all, but when trying to create hundreds of single node clusters things get very slow.

What steps did you take and what happened:

The experiment setup, including all scripts can be found here. In short, this is how it works:

Management cluster: A normal KinD cluster with 3 nodes
CAPI + CAPM3 are installed using clusterctl as normal
BareMetalOperator is deployed using static manifests, configured to run in test-mode.
The workload cluster's API are faked by a k8s API server and etcd pod running in the management cluster (one kube-apiserver pod for all the workload clusters)
For each cluster
- The cluster, KCP, BMH and relevant templates are created
- Pre-generated CAs for etcd and k8s are added (this is for faking the workload cluster API server)
- (optional) Pre-generated etcd client certificate is also added. This is for running in external etcd mode, which helps speed things up a bit.
- The workload cluster (fake) node is added to the workload cluster API with correct provider ID
- The workload cluster static pods (fake) are added to the workload cluster API

The simulation is not perfect, and perhaps this is is impacting the performance. I have not been able to confirm or rule this out. What I have found is this:

Since all workload clusters share one API, they will be able to see each others nodes. This makes the control planes a bit "confused" since they see nodes that does not have correlated Machines.
The Kubeadm control plane provider is trying to reach the (fake) static pods to check certificate expiration. To try to mitigate this, we attempted to set the expiration annotation on the KubeadmConfig, but unfortunately this caused some KCPs to start rollouts. It is unclear what is causing this.

Performance:

Scale to 100 clusters in ~15 minutes.
Scale to 300 clusters in ~135 minutes. Adding a single cluster at this point takes more than 8 minutes. In the experiment we create 10 clusters in parallel.

The bottleneck seems to be the Kubeadm control plane provider. There is a long pause after the KCP is created before the Machines appear. To mitigate this, I tried sharding by running one kubeadm control plane controller for each namespace, and grouping the workload clusters into these namespaces (10 namespaces with 10 clusters each). They basically ate the CPU and everything became slow. Maybe it is still the way to go, just with more CPU or fewer shards?

What did you expect to happen:

I was hoping to be able to reach 1000 workload clusters in a "reasonable" time and that creating new clusters would not take several minutes.

Anything else you would like to add:

I just want to highlight again that that the simulation is not perfect. If you have ideas for how to improve it, or ways to check if it is impacting the performance, I would be very happy to hear about it.

Environment:

Cluster-api version: v1.3.2
minikube/kind version: v0.17.0 (kind)
Kubernetes version: (use kubectl version): v1.25.3 (kind cluster) v1.26.1 (kubectl)
OS (e.g. from /etc/os-release): Ubuntu 22.04

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

killianmuldoon commented 1 year ago

This is amazing work! There's been a lot of questions around scaling recently so this is really useful, and definitely the best attempt so far at reproducible scale tests. I'm excited to see if I can get this running when I get time.

It would be interesting to be able to profile the CAPI controller code while this is running. BTW - what size of a machine was this running on, and was it resource contention or just slowness in the controller that caused the scaling issue?

Here's a couple of related issues for reference. https://github.com/kubernetes-sigs/cluster-api/issues/7308 https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/issues/63

lentzi90 commented 1 year ago

Thank you @killianmuldoon ! I'm following https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/issues/63 with great interest! Should have thought to include it in the issue directly... Let me know if you have time to try it, and if there are any issues!

Most of these tests have been on a cloud VM with 32 GB memory and 8 core CPU. It didn't look like resource contention (except when I did the sharding, that maxed out the CPU).

sbueringer commented 1 year ago

To try to mitigate this, we attempted to set the expiration annotation on the KubeadmConfig, but unfortunately this caused some KCPs to start rollouts. It is unclear what is causing this.

If you don't set rolloutBefore.certificatesExpiryDays no rollouts should be triggered by the annotation

sbueringer commented 1 year ago

First of all, great work. Nice to see that folks are starting to test CAPI at scale! :)

There is a long pause after the KCP is created before the Machines appear.

I assume with appear you mean that the Machine objects are not even created at this point?

It looks to me like you are running Cluster API in the default configuration. My first guess would be that increasing --kubeadmcontrolplane-concurrency should improve the situation. The default is 10, which means KCP can only reconcile 10 KCP objects at the same time. All others have to wait.

The next question would be what the 10 workers are actually doing. It might be that they are blocked in some way.

lentzi90 commented 1 year ago

Thanks for the comment @sbueringer !

My first guess would be that increasing --kubeadmcontrolplane-concurrency should improve the situation.

This is a good point. I should probably have tried that before sharding... I will try it now and see if it helps. :slightly_smiling_face:

I assume with appear you mean that the Machine objects are not even created at this point?

Exactly! Here I managed to capture what it looks like. When the KubeadmControlPlane is 63 seconds old, there is no Machine and the SA and proxy secrets have just been created. After this the Machine (and Metal3Machine) appears.

$ kubectl -n test-50 get kcp,machine,m3m,secret,m3d
NAME                                                        CLUSTER   INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/test-50   test-50                                                                                   63s   v1.25.3

NAME                                   TYPE                      DATA   AGE
secret/test-50-apiserver-etcd-client   kubernetes.io/tls         2      65s
secret/test-50-ca                      kubernetes.io/tls         2      66s
secret/test-50-etcd                    kubernetes.io/tls         2      66s
secret/test-50-proxy                   cluster.x-k8s.io/secret   2      1s
secret/test-50-sa                      cluster.x-k8s.io/secret   2      1s
secret/worker-1-bmc-secret             Opaque                    2      66s
$ kubectl -n test-50 get kcp,machine,m3m,secret,m3d
NAME                                                        CLUSTER   INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/test-50   test-50                                                                                   67s   v1.25.3

NAME                                     CLUSTER   NODENAME   PROVIDERID   PHASE   AGE   VERSION
machine.cluster.x-k8s.io/test-50-fg498   test-50                                   0s    v1.25.3

NAME                                                                       AGE   PROVIDERID   READY   CLUSTER   PHASE
metal3machine.infrastructure.cluster.x-k8s.io/test-50-controlplane-x6rnj   0s                         test-50

NAME                                   TYPE                      DATA   AGE
secret/test-50-apiserver-etcd-client   kubernetes.io/tls         2      69s
secret/test-50-ca                      kubernetes.io/tls         2      70s
secret/test-50-etcd                    kubernetes.io/tls         2      70s
secret/test-50-kubeconfig              cluster.x-k8s.io/secret   1      3s
secret/test-50-proxy                   cluster.x-k8s.io/secret   2      5s
secret/test-50-sa                      cluster.x-k8s.io/secret   2      5s
secret/worker-1-bmc-secret             Opaque                    2      70s

lentzi90 commented 1 year ago

To try to mitigate this, we attempted to set the expiration annotation on the KubeadmConfig, but unfortunately this caused some KCPs to start rollouts. It is unclear what is causing this.

If you don't set rolloutBefore.certificatesExpiryDays no rollouts should be triggered by the annotation

Thanks for confirming! I don't think it is the annotation that triggers it really. The controller just never makes it to the rollout part of the code until the annotation is set. So something is triggering rollout of some control planes and this is only visible after the annotation is added.

I would totally understand if all of them would rollout but that is never the case. Specifically the first batch seems more "resistant" to rollout. In later batches there are quite a lot of rollouts but the first is usually fine. This makes me think that there is some kind of race going on but I haven't figured out what.

As to what could be causing all of them to rollout, the number of replicas is wrong because of how I fake things:

$ kubectl get kcp -A
NAMESPACE   NAME       CLUSTER    INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE     VERSION
metal3      test       test       true          true                   1          241     1         -240          120m    v1.25.3
test-1      test-1     test-1     true          true                   1          241     1         -240          86m     v1.25.3
test-10     test-10    test-10    true          true                   1          241               -240          86m     v1.25.3
test-100    test-100   test-100   true          true                   1          241     0         -240          74m     v1.25.3
test-101    test-101   test-101   true          true                   1          241     1         -240          71m     v1.25.3
test-102    test-102   test-102   true          true                   1          241     1         -240          71m     v1.25.3
test-103    test-103   test-103   true          true                   1          241     1         -240          71m     v1.25.3

Now that I think about it, it probably has something to do with the UPDATED count. I haven't been able to figure out why some have 1, some have 0 and some nothing at all.

lentzi90 commented 1 year ago

It looks to me like you are running Cluster API in the default configuration. My first guess would be that increasing --kubeadmcontrolplane-concurrency should improve the situation. The default is 10, which means KCP can only reconcile 10 KCP objects at the same time. All others have to wait.

I have now tried with more concurrency (100 seems optimal for what I'm doing) and this helped cut the time from 135 minutes to 70 minutes for 300 clusters in batches of 10! :tada: It still becomes slower and slower the more clusters you add but this is still a good improvement.

I went on and scaled up to 500 clusters (from 300 to 500 took around 2 hours). Then I set it to scale to 800, but unfortunately it ran out of memory before it could reach it. I will try with a bigger VM and see if I can reach 1000 overnight.

By the way, if anyone has any clues about the KCP rollout and certificate expiration I'm all ears! I think if we can solve that we would get a completely "normal" reconciliation and that would make me much more confident with the results.

fabriziopandini commented 1 year ago

/triage accepted Great work of research! The next step is to translate this into actionable improvements on KCP or other controllers. Metrics and the work on logs will help in doing so, but this is a awesome start

lentzi90 commented 1 year ago

Another progress report!

I have found the cause of the KCP rollout issue. It was stupid really. Because all workload clusters shared the same API server, it would sometimes happen that CAPI detected that the cluster was already initialized before generating the kubeadm config. In this case it would naturally generate a join config instead of an init config and this triggered the rollout because the kubeadm config didn't match what was expected. :facepalm:

Well, then I decided to solve this "properly" by giving each workload cluster their own API server. First I did it with both etcd and API server for each and then to conserve resources I set up a multi-tenant etcd to back all the API servers. (Learning a lot here! :sweat_smile: ) This works great! CAPI is very happy with the workload clusters now that they don't have any extra Nodes from sharing the API server.

However, there is a reason why I did it initially with only one API server: they are memory hungry! Each API server (so each workload cluster essentially) uses over 200 Mi memory when it starts. To get to 1000 clusters with this would require more than 200 Gi memory just for the API servers. Not nice. I have scaled it to 100 clusters on my laptop without issue but any more starts hitting the limits. My plan now is to try it in a larger VM and see if I can get to 300 to compare cluster creation time with what I had before.

If you have any ideas for how to lower the memory usage of these API servers, or maybe other ways to fake them, I would love to hear about it!

Here is a link to the latest experiment setup in case you are interested: https://github.com/Nordix/metal3-clusterapi-docs/tree/lentzi90/scaling-experiments-v2/metal3-scaling-experiments

fabriziopandini commented 1 year ago

@lentzi90 those are great insights!

And we definitely need to join forces given that this work is relevant for the entire community cc @sbueringer @killianmuldoon

If you have any ideas for how to lower the memory usage of these API servers, or maybe other ways to fake them, I would love to hear about it!

I don't think we can fit as much as we want in a single machine, and continuing down this path also introduces some other issues like the noise from the fake workload clusters which can affect the management cluster and test results.

What we are doing in the kubemark provider is introducing the idea of a "backing clusters", which are one or more external clusters which are providing the necessary computing power to run all the fake workloads clusters (also K8s scale tests are using a simular approach).

By moving fake clusters to external clusters you can potentially scale indefinitely, and you can always fallback on testing everything on one machine when you are working with less power angry test targets or for smoke tests.

Then to conserve resources I set up a multi-tenant etcd to back all the API servers. (Learning a lot here! 😅 )

That's a great idea, we should definitely embed this on https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/issues/63 as soon as I get to implementing the control plane part, if you don't get there before me 😜

lentzi90 commented 1 year ago

Thanks for the idea about "backing clusters"! With this I managed to get to 1000 clusters :tada: and found some things to improve along the way. I have pushed my notes (same link/branch as before). It is not quite as well documented this time, but I will follow up with issues and more investigations for the missing parts when I have the time. And I will make an attempt to describe what I did here also.

With the backing clusters working, I set out to see how far I could get and compare performance with previous runs. Performance was looking good at first! I got to 300 clusters without issues, but after 400 things slowed down. I had forgotten about concurrency. :facepalm: Fixed that and went on to 500, 600, nice and fast! Obviously the previous experiment setup had affected the performance in earlier attempts because this was much faster.

Then I hit the wall. From 600 to 700 clusters took more than 2 hours! At first I thought this could be a limit on the CAPM3 side (where we didn't have a concurrency flag). But why would it show up so suddenly? I checked the logs of all the controllers and noticed some "client-side throttling", like this:

I0302 15:50:57.343130       1 request.go:601] Waited for 4.870221494s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/api/v1/namespaces/test-697/secrets/test-697-sa

This sounded more probable to me and could very well show up suddenly. When I realized the default sync period is 10 minutes = 600 seconds, and the issue appeared right when I went beyond 600 cluster (= 1 cluster / second) I was sure this must be it.

The default in client-go is 10 qps but I believe CAPI gets the default from the controller-runtime where it is 20 qps.

I set up tilt and managed to change these values (set them to 200 QPS for all controllers for this experiment). This helped and I managed to scale faster again. However, I had to set up a much larger VM for the management cluster since the KCP controller basically ate all the CPU it could get with these higher rate limits. :open_mouth: :fire:

With this setup I managed to get to 1000 clusters. Here are the timings: With higher rate limits and concurrency for CAPI. Management cluster running on 32C-64GB VM.

Scale 0-100 in ~10 minutes.
Scale 100-200 in ~10 minutes.
Scale 200-300 in ~15 minutes.
Scale 300-400 in ~20 minutes.
Scale 400-500 in ~25 minutes.
Scale 500-600 in ~30 minutes.
Scale 600-700 in ~30 minutes.
Scale 700-800 in ~35 minutes.
Scale 800-900 in ~35 minutes.
Scale 900-1000 in ~45 minutes. 🎉

And some metrics (note that the KCP controller reached this level of CPU usage long before reaching 1000 clusters!):

❯ kubectl top pods -A
NAMESPACE                           NAME                                                             CPU(cores)   MEMORY(bytes)
baremetal-operator-system           baremetal-operator-controller-manager-64c5489695-n9bhp           35m          76Mi
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-c99b96648-rprvr        104m         71Mi
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-7c5fc49c58-4qm6r   16531m       1898Mi
capi-system                         capi-controller-manager-5cf7775bb4-68sr4                         2525m        1322Mi
capm3-system                        capm3-controller-manager-669989d4-w6st2                          454m         352Mi
capm3-system                        ipam-controller-manager-65fc446776-8pcfk                         2m           15Mi

Also worth noting that this is still with external etcd, so the KCP controller is affected by that and I would expect even higher usage with "internal" etcd, but who knows. :shrug:

Take away:

I think we need to make the rate-limit configurable
The KCP controller is extremely CPU hungry. This is worth investigating more closely.

fabriziopandini commented 1 year ago

Those are again great insights, @lentzi90 thanks for sharing. I agree that kcp cpu and memory consumption is something to be investigated. I'm not sure about the correlation between the number of clusters and qps, because my assumption is that a cluster will "stop" being reconciled as soon as it is provisioned, so it should not clog the reconcile queue (with the exception of resync event every 10 minutes). But this is when the work on metrics becomes relevant for finding bottlenecks and also to explaining why provisioning time is degrading.

If this is ok for you It would be great to set up some time to discuss the possible next step of this work, and possibly how to upstream it. We can also discuss this at KubeCon if you are planning to make it, otherwise, I will be happy to set up something using the CAPI project zoom, so we can also record and share with the others members of the community.

fabriziopandini commented 1 year ago

cc @richardcase who might be interested in the discussion as well

lentzi90 commented 1 year ago

I'd be very happy to set up a time to discuss it further! Let's sync on slack. I'm also planning to attend KubeCon so that would be a great option! :slightly_smiling_face:

I'm not sure about the correlation between the number of clusters and qps, because my assumption is that a cluster will "stop" being reconciled as soon as it is provisioned, so it should not clog the reconcile queue (with the exception of resync event every 10 minutes).

With enough clusters that resync every 10 minutes adds up! With 600 clusters it becomes on average 1 cluster every second. Combining that with more clusters created all the time, I think it can be a limiting factor. It is very suspicious to me that the slow down happened right at 600 clusters, since it is a multiple of the resync interval (600 seconds). This is why I think there is a correlation. If I get time to check it, I would try with a different interval, e.g. 300 seconds, and see if it then becomes slow at 300 clusters.

sbueringer commented 1 year ago

Might be worth looking if the work queue length metrics shows supicious behavior (e.g. goes up non-linear at some point)

richardcase commented 1 year ago

cc @richardcase who might be interested in the discussion as well

Thanks @fabriziopandini , very timely. This is gold @lentzi90 , great work and super helpful.

mtougeron commented 1 year ago

This is extremely useful/helpful to me, thank you for the hard work!

sbueringer commented 1 year ago

/reopen

I assume closing this issue with #8579 was not intended

k8s-ci-robot commented 1 year ago

@sbueringer: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/8052#issuecomment-1531415094): >/reopen > >I assume closing this issue with #8579 was not intended Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

lentzi90 commented 1 year ago

For me it is ok to close it but if you want to keep it for more discussions that is also fine :slightly_smiling_face: I know there is work on-going with adding scalability e2e tests, maybe that can be tracked here? I marked the PR as fixing this since it solved the main blocker for my use case, but there is definitely more to do

sbueringer commented 1 year ago

Absolutely up to you :). If you want to continue and bring up other issues feel free to do it either here or in a separate issue. (feel free to close again)

We'll definitely track the scalability e2e test in a separate issue.

lentzi90 commented 1 year ago

Thanks! I'll close this then. It will be easier to track specific issues that way. This has been a great discussion and exploration, thanks!

kubernetes-sigs / cluster-api

Slow cluster creation when working with hundreds of clusters #8052