open-policy-agent / gatekeeper

🐊 Gatekeeper - Policy Controller for Kubernetes
https://open-policy-agent.github.io/gatekeeper/
Apache License 2.0
3.68k stars 755 forks source link

OOMKilled - gatekeeper:v3.4.0 #1279

Closed abhi-kapoor closed 2 years ago

abhi-kapoor commented 3 years ago

gatekeeper-audit seems to be consuming a lot of memory. Initially, we observed that the pod was crashlooping, as it was being OOMKilled. We have bumped the limits couple of times now, but it still ends up using whatever limits we set.

We have used a VerticalPodAutoscaler on the gatekeeper-audit deployment to get insights on the memory consumption and what the target memory should be. We have tried adjusting the resources few times now, but the memory consumption keeps growing. As of now, it looks something like:

    resources:
      limits:
        cpu: "1"
        memory: 850Mi
      requests:
        cpu: 100m
        memory: 850Mi

image

I am a bit curious to know how this actually works. We are deploying this on a shared multi-tenancy cluster, so there will be much more api-resources added as we add new tenants. As of now, we just have a single basic rule K8sRequiredLabels as a POC.

It seems like the way gatekeeper-audit works is that it pretty much loads all resources into memory and performs an audit by using the rules defined.

Are there any recommendations on what we should be doing on our end to improve this memory utilization. I have also reviewed the below related issues and followed the recommendations, but no luck:

  1. https://github.com/open-policy-agent/gatekeeper/issues/339
  2. https://github.com/open-policy-agent/gatekeeper/issues/780

Kubernetes version:

kubectl version --short=true
Client Version: v1.15.0
Server Version: v1.17.17-gke.3000
ritazh commented 3 years ago

@abhinav454 Thanks for reporting the issue. Few questions to help us understand your environment:

abhi-kapoor commented 3 years ago

@ritazh Thank you for taking a look at this.

We might be using the default settings and the configuration used is as below:

--constraint-violations-limit=20
--audit-from-cache=false
--audit-chunk-size=0
--audit-interval=60

Oh, this seems to be a great feature. We don't have this enabled yet, will go ahead and enable this and get back to you once it is enabled. I do hope that this will be an effective use of resources as we are only using it for a single kind

  • can you share your constraint(s)

The constraint seems to be that the pod seems to be Crashlooping as it is getting OOM killed. We have bumped the memory couple of times, but it still seems to be running out of memory. Since our cluster is a multi-tenancy K8s cluster, I am afraid as we add new services, it will keep asking for more and more memory.

Hope this provides the information you might be looking for, else I will be more than happy to provide any other information which can assist with troubleshooting this further. I do have a hunch that setting the below flag will help --audit-match-kind-only=true for now.

sozercan commented 3 years ago

@abhinav454 depending on the number of resources you are trying to audit in your cluster (# of pods, configmaps, etc), you can set --audit-chunk-size also. This will process audit in smaller chunks instead of one big chunk which might reduce memory consumption.

How many constraint templates and constraints per template do you have deployed in your cluster? and what are the # of resources you are trying to audit (like # of pods)?

abhi-kapoor commented 3 years ago

@abhinav454 depending on the number of resources you are trying to audit in your cluster (# of pods, configmaps, etc), you can set --audit-chunk-size also. This will process audit in smaller chunks instead of one big chunk which might reduce memory consumption.

Ohhh, that would also be helpful 🙇‍♂️

How many constraint templates and constraints per template do you have deployed in your cluster?

We only have a single template ConstraintTemplate with a single constraint as of now. This was really just a POC and we were planning to add more, but seemed to have faced the issues. I will set this flag as well and report back on how the pod is doing.

abhi-kapoor commented 3 years ago

@ritazh @sozercan I highly believe that setting --audit-match-kind-only=true flag will help resolve some of the issues we are facing. However, we use the Helm chart to deploy this and it seems like, passing this argument is not yet supported within the template: https://github.com/open-policy-agent/gatekeeper/blob/aa20de6acc0f26943305483271051e9317c2c6ec/charts/gatekeeper/templates/gatekeeper-audit-deployment.yaml#L47

I was about to open a PR to add support for that, but later realized that the helm charts are built automatically using helmify which already has the change: https://github.com/abhinav454/gatekeeper/blob/aa20de6acc0f26943305483271051e9317c2c6ec/cmd/build/helmify/kustomize-for-helm.yaml#L107

Does that mean that the next version will allow us to pass this argument? And if so, any time line for this would be much appreciated 🙏 🙇

sozercan commented 3 years ago

@abhinav454 that's right, looks like it got added in #1245 and it's in staging chart. It'll be available in the helm repo when we cut next minor version release.

abhi-kapoor commented 3 years ago

@abhinav454 that's right, looks like it got added in #1245 and it's in staging chart. It'll be available in the helm repo when we cut next minor version release.

Thank you for your quick response. Will add this flag as soon as it is available. That should address the issues we are facing 👍

ritazh commented 3 years ago

highly believe that setting --audit-match-kind-only=true flag will help resolve some of the issues we are facing.

Should we reconsider setting this value to true? @maxsmythe I know you had abjections to this. But given this is a recurring issue and this flag clearly helps. Let's revisit this.

maxsmythe commented 3 years ago

Have the arguments changed?

If this flag works for them now, it may stop working if they add a constraint that doesn't match against kind, or if they add another constraint against a different kind. They did mention they wanted to add more constraints. Users should opt in to such a limitation.

Also, we are not sure yet whether --match-kind-only will address the issue. Hopefully, but it fails in the case where clusters have large numbers of a given kind (e.g. lots of Pods). If that is the case, then a solution like chunking will be needed. Chunking should also be resilient to any set of inbound constraints regardless of their contents. Taking a look at the code, memory usage (with a single constraint) should be proportional to the number of resources associated with the most populous kind.

maxsmythe commented 3 years ago

@abhinav454 Can I ask:

maxsmythe commented 3 years ago

Also, if you could copy/paste your test constraint, I think that'd be interesting to look at.

maxsmythe commented 3 years ago

@abhinav454 Also, taking a look at your graph, I notice memory use seems to spike every few days even though your audit cycle is every 60 seconds.

Is the graph accurate? How often are you OOMing? If the OOMing is sporadic (i.e. the pod can run without crashing for, say, > 10 minutes), then the OOMing wouldn't be able to be explained by normal audit behavior.

maxsmythe commented 3 years ago

Mostly those major spikes in the graph are interesting to me, growth over time could be explained by increased #s of resources on the cluster.

abhi-kapoor commented 3 years ago

@maxsmythe Thank you for looking into this. I hope the below answers some of the questions you might have:

We bumped the gatekeeper-audit limits even further using the VPA recommendations and now the pod hasn't restarted or crashlooped since last 4d. The first time we used VPA to get recommendations on this, we bumped the limits accordingly, but the pod was still getting OOM Killed. Since, the second time we have done this exercise, it has been more stable.

We will be implementing both auditChunkSize and --audit-match-kind-only=true.

Currently, we are not using gatekeeper as much and have just implemented as a POC, so we have a single ConstraintTemplate and only 1 rule which is of kind K8sRequiredLabels and applies to Namespaces. For your reference, you can see the rule below:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: ns-must-have-name-label
spec:
  # operate in audit mode. ie do not enforce admission
  enforcementAction: dryrun
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Namespace"]
  parameters:
    labels: ["name"]

We haven't yet decided any other rules yet. We just wanted to run a single rule such as above and try it out. The cluster is a shared mult-tenancy cluster and is a large cluster which runs multiple tenants. As we onboard more and more tenants, more api-resources would be added to the cluster.

maxsmythe commented 3 years ago

Thanks for the extra info!

Can you give us:

Those data will make it easier to distinguish between the following scenarios:

Which would be super helpful, as it would let us know if there is a performance bug we should be targeting to make ourselves more memory efficient.

cnwaldron commented 3 years ago

I ran into this same issues and implemented the additional arg of --audit-match-kind-only=true. That didn't stop the OOMkilled pods. I added --audit-chunk-size=500 and that did the trick.

sozercan commented 3 years ago

@cnwaldron glad chunk size worked for you! just curious, did you have any constraints that didn't have match kind defined?

cnwaldron commented 3 years ago

@sozercan All constraints had match kind defined

abhi-kapoor commented 3 years ago

@maxsmythe After setting the --audit-chunk-size=500 flag, it has been working fine. In order to assist with troubleshooting any cache/buffer issues:

naveen210121 commented 3 years ago

Hi,

We are also facing the same issue every 5-10 min gatekeeper pod is getting OOMkilled - loopback off.

Configurations: GKE Version : 1.18 (GCP GKE Kubernetes cluster) Gatekeeper Version : gatekeeper:v3.1.0-beta.2 resources: limits: cpu: 1500m memory: 1500Mi requests: cpu: 800m memory: 800Mi

there are total of ~800 pods running in the cluster.

I have verified node capacity it has good amount of memory where pods can scale with limits. Here I am clearing out that pod is not crashing because of limit overcommit.

@sozercan, @maxsmythe, @ritazh, @abhinav454: Please suggest best suitable requests/limits values for gatekeeper workload.

maxsmythe commented 3 years ago

@abhinav454 Thank you for the data!

@naveen-mindtree Unfortunately it's hard to come up with scaling recommendations for audit memory usage, as it depends on:

A quick way to figure out memory usage experimentally is to just double the usage until the pod becomes stable, then you can scale back to use only the memory required by the pod (with some overhead for growth).

naveen210121 commented 3 years ago

Thank you so much @maxsmythe

So it will be like try and error method. Thanks again.