KCP CPU hungry with many clusters

lentzi90 commented 1 year ago

What steps did you take and what happened?

When scaling the number of workload clusters, the KCP controller stands out as extremely CPU hungry. I have been investigating this a bit closer now and found some hints about why this is. This is what I did:

Set up a CAPI development environment with tilt and observability, profiling enabled

# tilt-settings.yaml
default_registry: gcr.io/cluster-api-provider
enable_providers:
- docker
- kubeadm-bootstrap
- kubeadm-control-plane
deploy_observability:
- grafana
- kube-state-metrics
- prometheus
debug:
core:
  profiler_port: 40000
kubeadm-control-plane:
  profiler_port: 40001
kubeadm-bootstrap:
  profiler_port: 40002

Observe metrics at idle with 0 clusters created and gather profiling data
Create 10 clusters and let things settle
1. Note that I didn't apply any CNI in the workload clusters. This means they were re-queued more often than the sync interval because the nodes never became fully Ready, causing a bit higher load on the KCP controller. More clusters would anyway cause the same load though so I think this just helps the profiling.
Compare metrics and gather profiling data again

I took profiling samples using this command:

go tool pprof -http=:8080 -seconds=10 http://localhost:40001/debug/pprof/profile

I also created a simple dashboard in grafana to check CPU usage and workqueue metrics.

Already at 10 clusters the KCP controller has a much higher CPU usage than the other controllers, as seen here:

Screenshot from 2023-05-04 13-14-06

The flame graph shows quite clearly what it is about I think. Here is the profile.pb.gz, in case you want to investigate more.

We generate new client certificates all the time. This is a quite CPU intensive operation. For large numbers of KCPs it becomes crazy!

What did you expect to happen?

I'm hoping the KCP controller could be optimized to reduce the CPU requirements for larger numbers of clusters. Probably this will require some kind of cache for re-using client certificates used for connecting to the workload clusters. There is a TODO comment along these lines in the code that I found also: https://github.com/kubernetes-sigs/cluster-api/blob/7edbaf04263c77177ae68f4d64115d207fadb7eb/controlplane/kubeadm/internal/cluster.go#L104

Cluster API version

Main branch (546f46464da13a98b87d469924b407fafe088df8)

Kubernetes version

$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.1", GitCommit:"4c9411232e10168d7b050c49a1b59f6df9d7ea4b", GitTreeState:"clean", BuildDate:"2023-04-14T13:21:19Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-30T06:34:50Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}

Anything else you would like to add?

Related issues:

7308
8052

Label(s) to be applied

/kind bug /label area/control-plane

k8s-ci-robot commented 1 year ago

@lentzi90: The label(s) /label area/control-plane cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/8602): >### What steps did you take and what happened? > >When scaling the number of workload clusters, the KCP controller stands out as [extremely CPU hungry](https://github.com/kubernetes-sigs/cluster-api/issues/8052#issuecomment-1456178857). I have been investigating this a bit closer now and found some hints about why this is. This is what I did: > >1. Set up a CAPI development environment with tilt and observability, profiling enabled > ```yaml > # tilt-settings.yaml > default_registry: gcr.io/cluster-api-provider > enable_providers: > - docker > - kubeadm-bootstrap > - kubeadm-control-plane > deploy_observability: > - grafana > - kube-state-metrics > - prometheus > debug: > core: > profiler_port: 40000 > kubeadm-control-plane: > profiler_port: 40001 > kubeadm-bootstrap: > profiler_port: 40002 > ``` >2. Observe metrics at idle with 0 clusters created and gather profiling data >4. Create 10 clusters and let things settle > 1. Note that I didn't apply any CNI in the workload clusters. This means they were re-queued more often than the sync interval because the nodes never became fully Ready, causing a bit higher load on the KCP controller. More clusters would anyway cause the same load though so I think this just helps the profiling. >6. Compare metrics and gather profiling data again > >I took profiling samples using this command: > >```bash >go tool pprof -http=:8080 -seconds=10 http://localhost:40001/debug/pprof/profile >``` > >I also created a simple [dashboard](https://github.com/kubernetes-sigs/cluster-api/files/11396432/dashboard.txt) in grafana to check CPU usage and workqueue metrics. > >Already at 10 clusters the KCP controller has a much higher CPU usage than the other controllers, as seen here: > >![Screenshot from 2023-05-04 13-14-06](https://user-images.githubusercontent.com/9117693/236178989-d069592d-72ad-4d87-aec3-b94d6dd91cbb.png) > >The flame graph shows quite clearly what it is about I think. Here is the [profile.pb.gz](https://github.com/kubernetes-sigs/cluster-api/files/11396512/profile.pb.gz), in case you want to investigate more. > >![image](https://user-images.githubusercontent.com/9117693/236179436-a49237ee-4f14-470a-897b-462e3b97e4b0.png) > >We generate new client certificates all the time. This is a quite CPU intensive operation. For large numbers of KCPs it becomes crazy! > > > > >### What did you expect to happen? > >I'm hoping the KCP controller could be optimized to reduce the CPU requirements for larger numbers of clusters. Probably this will require some kind of cache for re-using client certificates used for connecting to the workload clusters. There is a TODO comment along these lines in the code that I found also: https://github.com/kubernetes-sigs/cluster-api/blob/7edbaf04263c77177ae68f4d64115d207fadb7eb/controlplane/kubeadm/internal/cluster.go#L104 > >### Cluster API version > >Main branch (546f46464da13a98b87d469924b407fafe088df8) > >### Kubernetes version > >```console >$ kubectl version >WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. >Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.1", GitCommit:"4c9411232e10168d7b050c49a1b59f6df9d7ea4b", GitTreeState:"clean", BuildDate:"2023-04-14T13:21:19Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"linux/amd64"} >Kustomize Version: v5.0.1 >Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-30T06:34:50Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"} >``` > >### Anything else you would like to add? > >Related issues: >- #7308 >- #8052 > >### Label(s) to be applied > >/kind bug >/label area/control-plane > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

killianmuldoon commented 1 year ago

/triage accepted

Great finding!

sbueringer commented 1 year ago

Thank you very much.

Sounds reasonable.

I think we can either extend the ClusterCacheTracker in some way to store the etcd client credentials or we have roughly a smaller copy of it just for etcd certs in KCP.

I slightly lean towards the first option.

The tricky parts are the things we already solved in ClusterCacheTracker:

re-create credentials when they don't work anymore
remove the credentials when the cluster was deleted

It's not super trivial to implement this without missing something, which is why I lean towards extending the ClusterCacheTracker.

sbueringer commented 1 year ago

@vincepri Opinions?

sbueringer commented 1 year ago

Now that I'm looking at it again. There's probably also a way to avoid doing this 4x in the same Reconcile call (which should already improve it by 4x)

fabriziopandini commented 1 year ago

This is a great finding, kudos to @lentzi90!

WRT to the solution, I'm ok with storing the etcd client cert in the cluster cache tracker, even if I will keep those separated from the clusterAccessor so we are not recreating certificates every time there is a connection issue.

sbueringer commented 1 year ago

WRT to the solution, I'm ok with storing the etcd client cert in the cluster cache tracker, even if I will keep those separated from the clusterAccessor so we are not recreating certificates every time there is a connection issue.

I think that's where it becomes complicated. If we don't store it in the clusterAccessor we need separate health checking for etcd client certs and additional logic to remove them when the Cluster is deleted. Without health checking we would never re-create the certificates as soon as they become invalid.

But an alternative could also be to just store them in some map with ttl and re-create them when we run into a connection error. Might be a lot simpler.

But I think we need some research there to see what exactly our options are.

Ensuring we only generate certs once per reconcile sounds more straightforward and would already give us a 4x improvement.

fabriziopandini commented 1 year ago

Agreed we need some research (and on the caveats you listed above). I will be happy to take a stub to it as soon as I manage to trim down my backlog to something reasonable (hopefully by EOW) If anyone gets to this before me, feel free to pick this up

/help

k8s-ci-robot commented 1 year ago

@fabriziopandini: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/8602): >Agreed we need some research (and on the caveats you listed above). >I will be happy to take it if I manage to trim down my backlog to something reasonable quickly >If anyone gets to this before than me, feel free to pick this up > >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

fabriziopandini commented 1 year ago

/assign

fabriziopandini commented 1 year ago

I have created a minimal fix using the ClusterCache tracker to store the private key used in generatedClientCert, so the most consuming step in the above flame graph will be executed once per cluster instead than 4 times for each reconcile.

@sbueringer PTAL when you have some bandwidth @lentzi90 it will be great if you can validate this PR (if you are not using main, you can cherry-pick the commit on top of your version)

lentzi90 commented 1 year ago

Thanks for the patch @fabriziopandini ! I have verified it and commented on the PR directly with findings :slightly_smiling_face:

kubernetes-sigs / cluster-api