Closed lentzi90 closed 1 year ago
This is amazing work! There's been a lot of questions around scaling recently so this is really useful, and definitely the best attempt so far at reproducible scale tests. I'm excited to see if I can get this running when I get time.
It would be interesting to be able to profile the CAPI controller code while this is running. BTW - what size of a machine was this running on, and was it resource contention or just slowness in the controller that caused the scaling issue?
Here's a couple of related issues for reference. https://github.com/kubernetes-sigs/cluster-api/issues/7308 https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/issues/63
Thank you @killianmuldoon ! I'm following https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/issues/63 with great interest! Should have thought to include it in the issue directly... Let me know if you have time to try it, and if there are any issues!
Most of these tests have been on a cloud VM with 32 GB memory and 8 core CPU. It didn't look like resource contention (except when I did the sharding, that maxed out the CPU).
To try to mitigate this, we attempted to set the expiration annotation on the KubeadmConfig, but unfortunately this caused some KCPs to start rollouts. It is unclear what is causing this.
If you don't set rolloutBefore.certificatesExpiryDays no rollouts should be triggered by the annotation
First of all, great work. Nice to see that folks are starting to test CAPI at scale! :)
There is a long pause after the KCP is created before the Machines appear.
I assume with appear you mean that the Machine objects are not even created at this point?
It looks to me like you are running Cluster API in the default configuration. My first guess would be that increasing --kubeadmcontrolplane-concurrency
should improve the situation. The default is 10, which means KCP can only reconcile 10 KCP objects at the same time. All others have to wait.
The next question would be what the 10 workers are actually doing. It might be that they are blocked in some way.
Thanks for the comment @sbueringer !
My first guess would be that increasing --kubeadmcontrolplane-concurrency should improve the situation.
This is a good point. I should probably have tried that before sharding... I will try it now and see if it helps. :slightly_smiling_face:
I assume with appear you mean that the Machine objects are not even created at this point?
Exactly! Here I managed to capture what it looks like. When the KubeadmControlPlane is 63 seconds old, there is no Machine and the SA and proxy secrets have just been created. After this the Machine (and Metal3Machine) appears.
$ kubectl -n test-50 get kcp,machine,m3m,secret,m3d
NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/test-50 test-50 63s v1.25.3
NAME TYPE DATA AGE
secret/test-50-apiserver-etcd-client kubernetes.io/tls 2 65s
secret/test-50-ca kubernetes.io/tls 2 66s
secret/test-50-etcd kubernetes.io/tls 2 66s
secret/test-50-proxy cluster.x-k8s.io/secret 2 1s
secret/test-50-sa cluster.x-k8s.io/secret 2 1s
secret/worker-1-bmc-secret Opaque 2 66s
$ kubectl -n test-50 get kcp,machine,m3m,secret,m3d
NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/test-50 test-50 67s v1.25.3
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
machine.cluster.x-k8s.io/test-50-fg498 test-50 0s v1.25.3
NAME AGE PROVIDERID READY CLUSTER PHASE
metal3machine.infrastructure.cluster.x-k8s.io/test-50-controlplane-x6rnj 0s test-50
NAME TYPE DATA AGE
secret/test-50-apiserver-etcd-client kubernetes.io/tls 2 69s
secret/test-50-ca kubernetes.io/tls 2 70s
secret/test-50-etcd kubernetes.io/tls 2 70s
secret/test-50-kubeconfig cluster.x-k8s.io/secret 1 3s
secret/test-50-proxy cluster.x-k8s.io/secret 2 5s
secret/test-50-sa cluster.x-k8s.io/secret 2 5s
secret/worker-1-bmc-secret Opaque 2 70s
To try to mitigate this, we attempted to set the expiration annotation on the KubeadmConfig, but unfortunately this caused some KCPs to start rollouts. It is unclear what is causing this.
If you don't set rolloutBefore.certificatesExpiryDays no rollouts should be triggered by the annotation
Thanks for confirming! I don't think it is the annotation that triggers it really. The controller just never makes it to the rollout part of the code until the annotation is set. So something is triggering rollout of some control planes and this is only visible after the annotation is added.
I would totally understand if all of them would rollout but that is never the case. Specifically the first batch seems more "resistant" to rollout. In later batches there are quite a lot of rollouts but the first is usually fine. This makes me think that there is some kind of race going on but I haven't figured out what.
As to what could be causing all of them to rollout, the number of replicas is wrong because of how I fake things:
$ kubectl get kcp -A
NAMESPACE NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION
metal3 test test true true 1 241 1 -240 120m v1.25.3
test-1 test-1 test-1 true true 1 241 1 -240 86m v1.25.3
test-10 test-10 test-10 true true 1 241 -240 86m v1.25.3
test-100 test-100 test-100 true true 1 241 0 -240 74m v1.25.3
test-101 test-101 test-101 true true 1 241 1 -240 71m v1.25.3
test-102 test-102 test-102 true true 1 241 1 -240 71m v1.25.3
test-103 test-103 test-103 true true 1 241 1 -240 71m v1.25.3
Now that I think about it, it probably has something to do with the UPDATED
count. I haven't been able to figure out why some have 1, some have 0 and some nothing at all.
It looks to me like you are running Cluster API in the default configuration. My first guess would be that increasing --kubeadmcontrolplane-concurrency should improve the situation. The default is 10, which means KCP can only reconcile 10 KCP objects at the same time. All others have to wait.
I have now tried with more concurrency (100 seems optimal for what I'm doing) and this helped cut the time from 135 minutes to 70 minutes for 300 clusters in batches of 10! :tada: It still becomes slower and slower the more clusters you add but this is still a good improvement.
I went on and scaled up to 500 clusters (from 300 to 500 took around 2 hours). Then I set it to scale to 800, but unfortunately it ran out of memory before it could reach it. I will try with a bigger VM and see if I can reach 1000 overnight.
By the way, if anyone has any clues about the KCP rollout and certificate expiration I'm all ears! I think if we can solve that we would get a completely "normal" reconciliation and that would make me much more confident with the results.
/triage accepted Great work of research! The next step is to translate this into actionable improvements on KCP or other controllers. Metrics and the work on logs will help in doing so, but this is a awesome start
Another progress report!
I have found the cause of the KCP rollout issue. It was stupid really. Because all workload clusters shared the same API server, it would sometimes happen that CAPI detected that the cluster was already initialized before generating the kubeadm config. In this case it would naturally generate a join config instead of an init config and this triggered the rollout because the kubeadm config didn't match what was expected. :facepalm:
Well, then I decided to solve this "properly" by giving each workload cluster their own API server. First I did it with both etcd and API server for each and then to conserve resources I set up a multi-tenant etcd to back all the API servers. (Learning a lot here! :sweat_smile: ) This works great! CAPI is very happy with the workload clusters now that they don't have any extra Nodes from sharing the API server.
However, there is a reason why I did it initially with only one API server: they are memory hungry! Each API server (so each workload cluster essentially) uses over 200 Mi memory when it starts. To get to 1000 clusters with this would require more than 200 Gi memory just for the API servers. Not nice. I have scaled it to 100 clusters on my laptop without issue but any more starts hitting the limits. My plan now is to try it in a larger VM and see if I can get to 300 to compare cluster creation time with what I had before.
If you have any ideas for how to lower the memory usage of these API servers, or maybe other ways to fake them, I would love to hear about it!
Here is a link to the latest experiment setup in case you are interested: https://github.com/Nordix/metal3-clusterapi-docs/tree/lentzi90/scaling-experiments-v2/metal3-scaling-experiments
@lentzi90 those are great insights!
And we definitely need to join forces given that this work is relevant for the entire community cc @sbueringer @killianmuldoon
If you have any ideas for how to lower the memory usage of these API servers, or maybe other ways to fake them, I would love to hear about it!
I don't think we can fit as much as we want in a single machine, and continuing down this path also introduces some other issues like the noise from the fake workload clusters which can affect the management cluster and test results.
What we are doing in the kubemark provider is introducing the idea of a "backing clusters", which are one or more external clusters which are providing the necessary computing power to run all the fake workloads clusters (also K8s scale tests are using a simular approach).
By moving fake clusters to external clusters you can potentially scale indefinitely, and you can always fallback on testing everything on one machine when you are working with less power angry test targets or for smoke tests.
Then to conserve resources I set up a multi-tenant etcd to back all the API servers. (Learning a lot here! 😅 )
That's a great idea, we should definitely embed this on https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/issues/63 as soon as I get to implementing the control plane part, if you don't get there before me 😜
Thanks for the idea about "backing clusters"! With this I managed to get to 1000 clusters :tada: and found some things to improve along the way. I have pushed my notes (same link/branch as before). It is not quite as well documented this time, but I will follow up with issues and more investigations for the missing parts when I have the time. And I will make an attempt to describe what I did here also.
With the backing clusters working, I set out to see how far I could get and compare performance with previous runs. Performance was looking good at first! I got to 300 clusters without issues, but after 400 things slowed down. I had forgotten about concurrency. :facepalm: Fixed that and went on to 500, 600, nice and fast! Obviously the previous experiment setup had affected the performance in earlier attempts because this was much faster.
Then I hit the wall. From 600 to 700 clusters took more than 2 hours! At first I thought this could be a limit on the CAPM3 side (where we didn't have a concurrency flag). But why would it show up so suddenly? I checked the logs of all the controllers and noticed some "client-side throttling", like this:
I0302 15:50:57.343130 1 request.go:601] Waited for 4.870221494s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/api/v1/namespaces/test-697/secrets/test-697-sa
This sounded more probable to me and could very well show up suddenly. When I realized the default sync period is 10 minutes = 600 seconds, and the issue appeared right when I went beyond 600 cluster (= 1 cluster / second) I was sure this must be it.
The default in client-go is 10 qps but I believe CAPI gets the default from the controller-runtime where it is 20 qps.
I set up tilt and managed to change these values (set them to 200 QPS for all controllers for this experiment). This helped and I managed to scale faster again. However, I had to set up a much larger VM for the management cluster since the KCP controller basically ate all the CPU it could get with these higher rate limits. :open_mouth: :fire:
With this setup I managed to get to 1000 clusters. Here are the timings: With higher rate limits and concurrency for CAPI. Management cluster running on 32C-64GB VM.
And some metrics (note that the KCP controller reached this level of CPU usage long before reaching 1000 clusters!):
❯ kubectl top pods -A
NAMESPACE NAME CPU(cores) MEMORY(bytes)
baremetal-operator-system baremetal-operator-controller-manager-64c5489695-n9bhp 35m 76Mi
capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager-c99b96648-rprvr 104m 71Mi
capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager-7c5fc49c58-4qm6r 16531m 1898Mi
capi-system capi-controller-manager-5cf7775bb4-68sr4 2525m 1322Mi
capm3-system capm3-controller-manager-669989d4-w6st2 454m 352Mi
capm3-system ipam-controller-manager-65fc446776-8pcfk 2m 15Mi
Also worth noting that this is still with external etcd, so the KCP controller is affected by that and I would expect even higher usage with "internal" etcd, but who knows. :shrug:
Take away:
Those are again great insights, @lentzi90 thanks for sharing. I agree that kcp cpu and memory consumption is something to be investigated. I'm not sure about the correlation between the number of clusters and qps, because my assumption is that a cluster will "stop" being reconciled as soon as it is provisioned, so it should not clog the reconcile queue (with the exception of resync event every 10 minutes). But this is when the work on metrics becomes relevant for finding bottlenecks and also to explaining why provisioning time is degrading.
If this is ok for you It would be great to set up some time to discuss the possible next step of this work, and possibly how to upstream it. We can also discuss this at KubeCon if you are planning to make it, otherwise, I will be happy to set up something using the CAPI project zoom, so we can also record and share with the others members of the community.
cc @richardcase who might be interested in the discussion as well
I'd be very happy to set up a time to discuss it further! Let's sync on slack. I'm also planning to attend KubeCon so that would be a great option! :slightly_smiling_face:
I'm not sure about the correlation between the number of clusters and qps, because my assumption is that a cluster will "stop" being reconciled as soon as it is provisioned, so it should not clog the reconcile queue (with the exception of resync event every 10 minutes).
With enough clusters that resync every 10 minutes adds up! With 600 clusters it becomes on average 1 cluster every second. Combining that with more clusters created all the time, I think it can be a limiting factor. It is very suspicious to me that the slow down happened right at 600 clusters, since it is a multiple of the resync interval (600 seconds). This is why I think there is a correlation. If I get time to check it, I would try with a different interval, e.g. 300 seconds, and see if it then becomes slow at 300 clusters.
Might be worth looking if the work queue length metrics shows supicious behavior (e.g. goes up non-linear at some point)
cc @richardcase who might be interested in the discussion as well
Thanks @fabriziopandini , very timely. This is gold @lentzi90 , great work and super helpful.
This is extremely useful/helpful to me, thank you for the hard work!
/reopen
I assume closing this issue with #8579 was not intended
@sbueringer: Reopened this issue.
For me it is ok to close it but if you want to keep it for more discussions that is also fine :slightly_smiling_face: I know there is work on-going with adding scalability e2e tests, maybe that can be tracked here? I marked the PR as fixing this since it solved the main blocker for my use case, but there is definitely more to do
Absolutely up to you :). If you want to continue and bring up other issues feel free to do it either here or in a separate issue. (feel free to close again)
We'll definitely track the scalability e2e test in a separate issue.
Thanks! I'll close this then. It will be easier to track specific issues that way. This has been a great discussion and exploration, thanks!
This is not really a bug, more like a performance issue. We hope to scale CAPI to thousands of workload clusters managed by a single management cluster. Before doing this with real hardware we are trying to check that the controllers can handle it. For a single workload cluster with 1000 Machines, this was no issue at all, but when trying to create hundreds of single node clusters things get very slow.
What steps did you take and what happened:
The experiment setup, including all scripts can be found here. In short, this is how it works:
clusterctl
as normalThe simulation is not perfect, and perhaps this is is impacting the performance. I have not been able to confirm or rule this out. What I have found is this:
Performance:
The bottleneck seems to be the Kubeadm control plane provider. There is a long pause after the KCP is created before the Machines appear. To mitigate this, I tried sharding by running one kubeadm control plane controller for each namespace, and grouping the workload clusters into these namespaces (10 namespaces with 10 clusters each). They basically ate the CPU and everything became slow. Maybe it is still the way to go, just with more CPU or fewer shards?
What did you expect to happen:
I was hoping to be able to reach 1000 workload clusters in a "reasonable" time and that creating new clusters would not take several minutes.
Anything else you would like to add:
I just want to highlight again that that the simulation is not perfect. If you have ideas for how to improve it, or ways to check if it is impacting the performance, I would be very happy to hear about it.
Environment:
kubectl version
): v1.25.3 (kind cluster) v1.26.1 (kubectl)/etc/os-release
): Ubuntu 22.04/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]