Open MacThrawn opened 1 year ago
when cert-utils starts it creates watch on secret and configmaps, so when you have ton of them it will generate significant load on the kube api servers. But that should be only at the beginning. Then the load should subside. Unless your secrets are constantly changing. We will do some investigation on this on our side.
Hello everyone, I can confirm the behavior if the cert-utils operator changes configmaps and secrets, which are created or managed by other operators in the cluster. Here is exactly the same problem: https://github.com/redhat-cop/cert-utils-operator/issues/120 or https://github.com/redhat-cop/cert-utils-operator/issues/127 The loops result in massive amounts of API calls.... And this is not an Openshift specific problem.... This is also not a problem or effect from a specific cluster size - here it only becomes significantly visible....
Hello,
We would like to install the Cert Utils Operator on OpenShift via OperatorHub. The OpenShift version we are using is 4.11.27 (Kubernetes: v1.24.6+263df15). The version of Cert Utils Operator is 1.3.10. After the installation of the Cert Utils Operator we see increasing memory usage in kube-apiservers:
As you can see the memory gets filled up at the first kube-apiserver and after this gets unavailable the next one is filled up and so on.
The OpenShift Cluster has 164 worker nodes and 3 "etcd" Nodes. Each of the etcd Nodes where the kube-apiserver is running has 12 vCores and 64 GB of memory. This cluster hosts several namespaces and applications which results in 12644 secrets and 5655 configmaps.
We stopped the attempt to install the Cert Utils Operator after two and a half hours because the cluster gets very unstable and it doesn't look like it will be healing soon. After uninstalling the Cert Utils Operator the Cluster gets back to normal operations and stability.
We also tested the installation on a smaller cluster with same versions and here we saw also for some time increased memory consumption on the kube-apiserver but after some minutes the kube-apiserver gets back to normal operations. We also see a high CPU consumption (up to 2 cores) of the Cert Utils Operator during this time.
The described behavior is every time reproduceable but it looks like it depends on size of the cluster/configmaps/secrets. We are running several other operators in the cluster but none of them shows a similar behavior.
For me it is not clear whether the Cert Utils Operator does something wrong or this is an issue with the large quantity of secrets and configmaps and inefficient processing in kube-apiserver?