OOMKilled - gatekeeper:v3.4.0

abhi-kapoor commented 3 years ago

gatekeeper-audit seems to be consuming a lot of memory. Initially, we observed that the pod was crashlooping, as it was being OOMKilled. We have bumped the limits couple of times now, but it still ends up using whatever limits we set.

We have used a VerticalPodAutoscaler on the gatekeeper-audit deployment to get insights on the memory consumption and what the target memory should be. We have tried adjusting the resources few times now, but the memory consumption keeps growing. As of now, it looks something like:

    resources:
      limits:
        cpu: "1"
        memory: 850Mi
      requests:
        cpu: 100m
        memory: 850Mi

I am a bit curious to know how this actually works. We are deploying this on a shared multi-tenancy cluster, so there will be much more api-resources added as we add new tenants. As of now, we just have a single basic rule K8sRequiredLabels as a POC.

It seems like the way gatekeeper-audit works is that it pretty much loads all resources into memory and performs an audit by using the rules defined.

Are there any recommendations on what we should be doing on our end to improve this memory utilization. I have also reviewed the below related issues and followed the recommendations, but no luck:

Kubernetes version:

kubectl version --short=true
Client Version: v1.15.0
Server Version: v1.17.17-gke.3000

ritazh commented 3 years ago

@abhinav454 Thanks for reporting the issue. Few questions to help us understand your environment:

what configurations do you have for audit? https://open-policy-agent.github.io/gatekeeper/website/docs/audit#configuring-audit
can you try setting the --audit-match-kind-only=true flag to see if things improve: https://open-policy-agent.github.io/gatekeeper/website/docs/audit#audit-using-kinds-specified-in-the-constraints-only
can you share your constraint(s)

abhi-kapoor commented 3 years ago

@ritazh Thank you for taking a look at this.

what configurations do you have for audit? https://open-policy-agent.github.io/gatekeeper/website/docs/audit#configuring-audit

We might be using the default settings and the configuration used is as below:

--constraint-violations-limit=20
--audit-from-cache=false
--audit-chunk-size=0
--audit-interval=60

can you try setting the --audit-match-kind-only=true flag to see if things improve: https://open-policy-agent.github.io/gatekeeper/website/docs/audit#audit-using-kinds-specified-in-the-constraints-only

Oh, this seems to be a great feature. We don't have this enabled yet, will go ahead and enable this and get back to you once it is enabled. I do hope that this will be an effective use of resources as we are only using it for a single kind

can you share your constraint(s)

The constraint seems to be that the pod seems to be Crashlooping as it is getting OOM killed. We have bumped the memory couple of times, but it still seems to be running out of memory. Since our cluster is a multi-tenancy K8s cluster, I am afraid as we add new services, it will keep asking for more and more memory.

Hope this provides the information you might be looking for, else I will be more than happy to provide any other information which can assist with troubleshooting this further. I do have a hunch that setting the below flag will help --audit-match-kind-only=true for now.

sozercan commented 3 years ago

@abhinav454 depending on the number of resources you are trying to audit in your cluster (# of pods, configmaps, etc), you can set --audit-chunk-size also. This will process audit in smaller chunks instead of one big chunk which might reduce memory consumption.

How many constraint templates and constraints per template do you have deployed in your cluster? and what are the # of resources you are trying to audit (like # of pods)?

abhi-kapoor commented 3 years ago

@abhinav454 depending on the number of resources you are trying to audit in your cluster (# of pods, configmaps, etc), you can set --audit-chunk-size also. This will process audit in smaller chunks instead of one big chunk which might reduce memory consumption.

Ohhh, that would also be helpful 🙇‍♂️

How many constraint templates and constraints per template do you have deployed in your cluster?

We only have a single template ConstraintTemplate with a single constraint as of now. This was really just a POC and we were planning to add more, but seemed to have faced the issues. I will set this flag as well and report back on how the pod is doing.

abhi-kapoor commented 3 years ago

@ritazh @sozercan I highly believe that setting --audit-match-kind-only=true flag will help resolve some of the issues we are facing. However, we use the Helm chart to deploy this and it seems like, passing this argument is not yet supported within the template: https://github.com/open-policy-agent/gatekeeper/blob/aa20de6acc0f26943305483271051e9317c2c6ec/charts/gatekeeper/templates/gatekeeper-audit-deployment.yaml#L47

I was about to open a PR to add support for that, but later realized that the helm charts are built automatically using helmify which already has the change: https://github.com/abhinav454/gatekeeper/blob/aa20de6acc0f26943305483271051e9317c2c6ec/cmd/build/helmify/kustomize-for-helm.yaml#L107

Does that mean that the next version will allow us to pass this argument? And if so, any time line for this would be much appreciated 🙏 🙇

sozercan commented 3 years ago

@abhinav454 that's right, looks like it got added in #1245 and it's in staging chart. It'll be available in the helm repo when we cut next minor version release.

abhi-kapoor commented 3 years ago

@abhinav454 that's right, looks like it got added in #1245 and it's in staging chart. It'll be available in the helm repo when we cut next minor version release.

Thank you for your quick response. Will add this flag as soon as it is available. That should address the issues we are facing 👍

ritazh commented 3 years ago

highly believe that setting --audit-match-kind-only=true flag will help resolve some of the issues we are facing.

Should we reconsider setting this value to true? @maxsmythe I know you had abjections to this. But given this is a recurring issue and this flag clearly helps. Let's revisit this.

maxsmythe commented 3 years ago

Have the arguments changed?

If this flag works for them now, it may stop working if they add a constraint that doesn't match against kind, or if they add another constraint against a different kind. They did mention they wanted to add more constraints. Users should opt in to such a limitation.

Also, we are not sure yet whether --match-kind-only will address the issue. Hopefully, but it fails in the case where clusters have large numbers of a given kind (e.g. lots of Pods). If that is the case, then a solution like chunking will be needed. Chunking should also be resilient to any set of inbound constraints regardless of their contents. Taking a look at the code, memory usage (with a single constraint) should be proportional to the number of resources associated with the most populous kind.

maxsmythe commented 3 years ago

@abhinav454 Can I ask:

What kinds of constraints are you interested in implementing?
Against which resources would these constraints apply?
Would your policy be centrally managed by a single individual/team, or would there be shared responsibility?
Is the team that runs Gatekeeper the same team that is writing the constraints/templates?
How large is your cluster?
What are your most populous kinds? How many member objects do they have?

maxsmythe commented 3 years ago

Also, if you could copy/paste your test constraint, I think that'd be interesting to look at.

maxsmythe commented 3 years ago

@abhinav454 Also, taking a look at your graph, I notice memory use seems to spike every few days even though your audit cycle is every 60 seconds.

Is the graph accurate? How often are you OOMing? If the OOMing is sporadic (i.e. the pod can run without crashing for, say, > 10 minutes), then the OOMing wouldn't be able to be explained by normal audit behavior.

maxsmythe commented 3 years ago

Mostly those major spikes in the graph are interesting to me, growth over time could be explained by increased #s of resources on the cluster.

abhi-kapoor commented 3 years ago

@maxsmythe Thank you for looking into this. I hope the below answers some of the questions you might have:

We bumped the gatekeeper-audit limits even further using the VPA recommendations and now the pod hasn't restarted or crashlooped since last 4d. The first time we used VPA to get recommendations on this, we bumped the limits accordingly, but the pod was still getting OOM Killed. Since, the second time we have done this exercise, it has been more stable.

We will be implementing both auditChunkSize and --audit-match-kind-only=true.

Currently, we are not using gatekeeper as much and have just implemented as a POC, so we have a single ConstraintTemplate and only 1 rule which is of kind K8sRequiredLabels and applies to Namespaces. For your reference, you can see the rule below:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: ns-must-have-name-label
spec:
  # operate in audit mode. ie do not enforce admission
  enforcementAction: dryrun
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Namespace"]
  parameters:
    labels: ["name"]

We haven't yet decided any other rules yet. We just wanted to run a single rule such as above and try it out. The cluster is a shared mult-tenancy cluster and is a large cluster which runs multiple tenants. As we onboard more and more tenants, more api-resources would be added to the cluster.

maxsmythe commented 3 years ago

Thanks for the extra info!

Can you give us:

Stable memory usage
What is most populous kind on your cluster? How many resources belong to that kind?
Frequency of crashes before you found a stable memory setting (was it about once every day like the graph suggests or was it more frequent?)

Those data will make it easier to distinguish between the following scenarios:

Audit memory usage is working as expected, and peak usage is governed by the most populous kind
We are somehow still caching all resources and therefore not scaling as well as we should be (e.g. maybe there is a hidden client cache we're invoking)
There is something that sporadically interferes with garbage collection (or some other issue that causes transient memory usage spikes)

Which would be super helpful, as it would let us know if there is a performance bug we should be targeting to make ourselves more memory efficient.

cnwaldron commented 3 years ago

I ran into this same issues and implemented the additional arg of --audit-match-kind-only=true. That didn't stop the OOMkilled pods. I added --audit-chunk-size=500 and that did the trick.

sozercan commented 3 years ago

@cnwaldron glad chunk size worked for you! just curious, did you have any constraints that didn't have match kind defined?

cnwaldron commented 3 years ago

@sozercan All constraints had match kind defined

abhi-kapoor commented 3 years ago

@maxsmythe After setting the --audit-chunk-size=500 flag, it has been working fine. In order to assist with troubleshooting any cache/buffer issues:

Pod has been stable, since we updated memory to 850Mi
Our rules were only tied to Namespace, but most populous kind on our cluster would be pods. Including all the system pods, there are total of ~750 pods running in the cluster.
The crashes were more frequent prior to updating the memory limits, every 10-20 min.

naveen210121 commented 3 years ago

Hi,

We are also facing the same issue every 5-10 min gatekeeper pod is getting OOMkilled - loopback off.

Configurations: GKE Version : 1.18 (GCP GKE Kubernetes cluster) Gatekeeper Version : gatekeeper:v3.1.0-beta.2 resources: limits: cpu: 1500m memory: 1500Mi requests: cpu: 800m memory: 800Mi

there are total of ~800 pods running in the cluster.

I have verified node capacity it has good amount of memory where pods can scale with limits. Here I am clearing out that pod is not crashing because of limit overcommit.

@sozercan, @maxsmythe, @ritazh, @abhinav454: Please suggest best suitable requests/limits values for gatekeeper workload.

maxsmythe commented 3 years ago

@abhinav454 Thank you for the data!

@naveen-mindtree Unfortunately it's hard to come up with scaling recommendations for audit memory usage, as it depends on:

Most populous kind in the cluster
Number of constraints
Number of violations a given resource has
Number of constraint templates
Whatever memory usage is required to run the constraint templates
Maybe more?

A quick way to figure out memory usage experimentally is to just double the usage until the pod becomes stable, then you can scale back to use only the memory required by the pod (with some overhead for growth).

naveen210121 commented 3 years ago

Thank you so much @maxsmythe

So it will be like try and error method. Thanks again.

open-policy-agent / gatekeeper

OOMKilled - gatekeeper:v3.4.0 #1279