redhat-cop / namespace-configuration-operator

The namespace-configuration-operator helps keeping configurations related to Users, Groups and Namespaces aligned with one of more policies specified as a CRs
Apache License 2.0
204 stars 55 forks source link

Namespace config operator is consuming too much memory #96

Closed hanzala1234 closed 9 months ago

hanzala1234 commented 3 years ago

Whenever we start the operator, memory consumption goes upto 20GB and our api server becomes unresponsive. API server starts consuming more than 15GB and then it gets killed and the master becomes unhealthy.

we have to scale down the namespace-config operator for making api server responsive again. what could be the reason that it consumes too much memory once it starts? could there be some memory leak? Is it possible that it reconciles resources in chunks rather than reconciling all together? how can we find the root cause? ns-operator-cropped

raffaelespazzoli commented 3 years ago

hello,

which version are you using? we used to have this kind issue in an old version. A "controller manager" is allocated per NamespaceConfig object. Each controller manager creates a new cache and different sets of watches to the master api. I'd expect memory consumption to be proportional to the number of NamespaceConfig objects, not the number of objects created as an effect of a NamespaceConfig, so that should allow you to scale easily. Please let me know if your experience is different.

On Mon, Apr 5, 2021 at 12:04 PM hanzala1234 @.***> wrote:

Whenever we start the operator, memory consumption goes upto 20GB and our api server becomes unresponsive. API server starts consuming more than 15GB and then it gets killed and the master becomes unhealthy. [image: ns-operator] https://user-images.githubusercontent.com/42064189/113594760-928a8100-9651-11eb-8224-f10f6b0e55eb.png

we have to scale down the namespace-config operator for making api server responsive again. what could be the reason that it consumes too much memory once it starts? could there be some memory leak? Is it possible that it reconciles resources in chunks rather than reconciling all together? how can we find the root cause?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/redhat-cop/namespace-configuration-operator/issues/96, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPERXCFLUIQANVHKBNBFRLTHHNR3ANCNFSM42NBAB3A .

-- ciao/bye Raffaele

rasheedamir commented 3 years ago

@raffaelespazzoli we have this version running currently "version":"1.0.3"

On this cluster where we are experiencing this issue we have 20 NamespaceConfig objects only.

We have been experience this issue for quite sometime now; and now it's a blocker! Last time it spiked to 20GB and then became stable at 6,5GB; but still 6,5GB is way too much for an operator

Currently it is scaled down to zero!

rasheedamir commented 3 years ago

Here is last 7 days usage

screencapture-grafana-openshift-monitoring-apps-devtest-41b996e9-kubeapp-cloud-d-a164a7f0339f99e89cea5cb47e9be617-kubernetes-compute-resources-workload-2021-04-05-20_22_13

rasheedamir commented 3 years ago

@raffaelespazzoli any thoughts on how we can troubleshoot it?

raffaelespazzoli commented 3 years ago

In the past we had a memory leak, but this does not seem to be the case as the memory allocation is constant. Which types of objects are created by your namespaceconfigs? How big is your cluster and how big is the etcd database? Besides the namespace pod using a lot of memory did you see any other side effects? Looking at the api server metrics, you should be able to plot the number of watches that the namespace pods open against it. I think that would also be useful to see.

On Tue, Apr 6, 2021 at 11:17 AM Rasheed Amir @.***> wrote:

@raffaelespazzoli https://github.com/raffaelespazzoli any thoughts on how we can troubleshoot it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/redhat-cop/namespace-configuration-operator/issues/96#issuecomment-814204627, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPERXGADZBULCNVTOY7LZTTHMQZBANCNFSM42NBAB3A .

-- ciao/bye Raffaele

hanzala1234 commented 3 years ago

Which types of objects are created by your namespaceconfigs?

We are creating mostly secrets,role and rolebindings and tekton resources (trigger template,triggerbinding, pipelines,event listener) How big is your cluster and how big is the etcd database?

we have 16 nodes with 3 master nodes ETCD size right now is 818Mi average.

Besides the namespace pod using a lot of memory did you see any other side effects?

API server is crashing. once we scale up the namespace config operator, the whole cluster gets affected.

raffaelespazzoli commented 3 years ago

how many namespace config objects do you have, how many namespaces do you have?

Can you run an experiment in which you create your namespace config object one every 5 minutes and monitor how the memory increases?

hanzala1234 commented 3 years ago

we have 20 namespace config objects. we have a total of 134 namespaces. but namespace config operator only applies to 30-40 namespaces. also, in our environment, we create namespaces dynamically as well for PR testings.

raffaelespazzoli commented 3 years ago

ok, so we can predict that the cache size should be 20x(object types created*number/size of objects of that type in etcd across the 134 namespaces). I'd like to see the memory progression when you add namespaceconfigs with the experiment described above. 20 namespace configs is a high number, have you put some thoughts in perhaps collapsing some of them?

On Wed, Apr 7, 2021 at 7:58 AM hanzala1234 @.***> wrote:

we have 20 namespace config objects. we have a total of 134 namespaces. but namespace config operator only applies to 30-40 namespaces. also, in our environment, we create namespaces dynamically as well for PR testings.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/redhat-cop/namespace-configuration-operator/issues/96#issuecomment-814854862, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPERXBYVHAFLXHXBTUHYUTTHRCFLANCNFSM42NBAB3A .

-- ciao/bye Raffaele

rasheedamir commented 3 years ago

20 is high number :(

Why is a "controller manager" is allocated per NamespaceConfig object?

raffaelespazzoli commented 3 years ago

20 is high number :( by that I mean that I had never seen before a deployment where so many definitions were needed. And that perhaps there is a way to collapse some of them and optimize. I didn't mean to say that the operator should not support it.

Why is a "controller manager" is allocated per NamespaceConfig object? that's how the operator is designed. One can't dynamically add watchers to a running controller-manager. so each time a NamespaceConfig object is created, the needed watchers are grouped into a new controller-manager.

raffaelespazzoli commented 2 years ago

may I close this issue?

Florian-94 commented 2 years ago

Hello, We are using the 1.2.0 version of the nsconfig operator on a 4.8.14 OCP cluster. We really appreciate it except for the RAM consumption ... On one of our Openshift BUILD cluster, we have 125 namespaceconfigs objects. One for each namespace (and its rolebindings, netpol, resourcequotas, limitrange ... associated). And we plan to host new clients (ie namespaces) soon. The limit for the nsconfig operator pod is 7Gb of RAM and it's not enough because the container restart every 15 minutes. We are going to set the limit to 10Gb of RAM which is huge and increase risk on scheduling this pod on ours workers. Is there a way that you change the behaviour of the operator to limit this need of RAM ? Thank you,

Florian

P.S : On another cluster, we have 35 namespaceconfigs objects for a RAM utilization stable with 1,15 Gb. It seems RAM consumption is not linear with number of namespaceconfigs objects.

raffaelespazzoli commented 2 years ago

I recommend upgrading but that will probably not solve your problem. @Florian-94 There is definitely a correlation between the number of namespace config and type of object being configured and the memory sued by this operator. This cannot be eliminated. Having one NamespaceConfig object per namespace is technically possible, but it's not what was intended for this operator. Can you share your use case? Maybe a couple of namespace config for different namespaces? I wonder if we can use the operator in a way that is more in line with what was intended.

Florian-94 commented 2 years ago

We have a web access portal where our customers can choose all specific parameters for limitrange and resourcequotas (the portal manages a process validation before creating namespaceconfigs CR on openshift cluster). May be we could use the "tee-shirt size" system offers by nsconfig operator for this usage. We also apply 2 network policies (in nsconfig CR) to be sure users can't modify / delete it (they are the same for all ns)

But, on this portal, our customers also manage users which will have an access on the namespace (kind: group in the namespaceconfigs objects with specifics userIDs in user field). So I can't see how to use a shared namespaceconfig template for this usage. May be the nsconfig operator was not the right choice for our needs. May be we should just apply k8s objects and prevent namespace admin users to edit them (with gatekeeper for example). We didn't see the ram problem caused by too many namespaceconfig ressources when we did this choice. Thanks for your help.

raffaelespazzoli commented 2 years ago

Both use cases should be addressable with a single namespace config. For the quotas create an annotation on the namespace with the needed values and then use the templating capability to apply them in a given namespace. Not sure about the network policy, I'd have to see it.

On Tue, Feb 22, 2022, 12:17 PM Florian-94 @.***> wrote:

We have a web access portal where our customers can choose all specific parameters for limitrange and resourcequotas (the portal manages a process validation before creating namespaceconfigs CR on openshift cluster). May be we could use the "tee-shirt size" system offers by nsconfig operator for this usage. We also apply 2 network policies (in nsconfig CR) to be sure users can't modify / delete it (they are the same for all ns)

But, on this portal, our customers also manage users which will have an access on the namespace (kind: group in the namespaceconfigs objects with specifics userIDs in user field). So I can't see how to use a shared namespaceconfig template for this usage. May be the nsconfig operator was not the right choice for our needs. May be we should just apply k8s objects and prevent namespace admin users to edit them (with gatekeeper for example). We didn't see the ram problem caused by too many namespaceconfig ressources when we did this choice. Thanks for your help.

— Reply to this email directly, view it on GitHub https://github.com/redhat-cop/namespace-configuration-operator/issues/96#issuecomment-1048027127, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPERXHV73DHYJGDJR6MLCLU4PAKDANCNFSM42NBAB3A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.*** com>

raffaelespazzoli commented 9 months ago

May I close this issue?

Florian-94 commented 9 months ago

Yes, you can close the issue for me. Thank you. We are not using namespace-configuration-operator anymore. Maybe one day if we decide to use size templates to manage quotas/limitrange for our projects.