Closed flavio closed 8 months ago
I've started looking into this issue during the last days.
TLDR: I think we can close this issue because we do not have high memory usage. There are some changes with regards to our architecture that we could do, but these would reduce the request per seconds that can be processed by a single instance of Policy Server.
I've ran the tests against Kubewarden 1.7.0-rc3. The tests have been done running a Policy Server with replicaSize
1.
Everything has been installed from our helm charts. I've enabled the recommended policies of the kubewarden-default
chart, plus I've added the ones used by our load-testing suite.
It's important to highlight I've not loaded any of the new "vanilla WASI" policies introduced by 1.7.0. The numbers shown inside of the memory reduction PR were skewed by the usage of the go-wasi-template policy. This is a policy built with the official Go compiler. It's huge (~20 Mb) and provides a distorted picture of our memory usage, see this comment I've just added on the closed PR.
To recap, I've been testing a Policy Server hosting 10 policies:
policies.yml
Another possible activity for the future: rewrite the load-tests to make use of k6.
Currently the load-tests are written with locus, which is doing a fine job. However, if we were to migrate to k6 we could then have a performance testing environment made of:
The most interesting - and appealing - advantage of this solution is the ability to correlate the load generated by the k6 suite and the resource usage. Right now this analysis required quite some manual work from my side. It would be great to have something that produces better and more consistent results.
IMHO this should be the 1st thing we should address
I agree with @flavio's arguments and we can leave this enhancements for the future. Maybe we can keep the improvements in the performance tests in the queue and leave the further changes for the future. Therefore, when we found any possible issue, we bring those improvements to the table again.
This is incredible work, thanks for the insights and the learning pointers.
From what is presented, I agree on the priorities, and I would also like to tackle a dynamic worker pool and trying wasmtime::InstancePre
, yet I prefer to punt on this.
I would start with refactoring the load-tests with k6.
We are using Kubewarden in some of our staging clusters and with 61 cluster-wide policies deployed the policy-server is using a lot of memory (40-50GB). It seems like that the memory used from the policy-server is linearly scaled with the number of policies and seems a bit excessive.
Any suggestion or optimization I could do on my side?
Can you provide some details about your usage pattern, this could help us fine tune the optimization ideas we have.
Some questions:
ClusterAdmissionPolicy
vs AdmissionPolicy
? Do you have the same (policy + settings) policy deployed inside of multiple namespaces via AdmissionPolicy
?ClusterAdmissionPolicies
@ish-xyz I'm currently working on a fix for this issue. Can you please provide more information about your environment (see previous comment)?
JFYI: there's a PR under review that reduces the amount of memory consume: https://github.com/kubewarden/policy-server/pull/596
@flavio How did you get this metric, is that via prometheus/cAdvisor? Yes prometheus. Do you happen to have the same policy deployed multiple times, but with different settings? That's correct, we use kubewarden as PSP replacement Have you configured the number of workers used by the policy server instance? We are using the default settings here Do you happen to be using the experimental Kyverno policy? No Do you have context aware policies loaded? Unsure, need to double check. I'll get back to you on this one Is the memory consumption increasing over the time? Yes, it looks like it I wonder if there's any policy that is leaking memory Could be, I'lll double check this
@ish-xyz we've finished a major refactor of the policy-server codebase (see https://github.com/kubewarden/policy-server/pull/596#issuecomment-1856344986).
Beginning of next year (Jan 2024), we are going to release a new version of Kubewarden. In the meantime it would be great if you could give a quick test to the changes by consuming the latest
version of the policy-server image.
@ish-xyz EDIT Ignore this as there was confusion between me and Flavio. The issue is with kubeapi and not policy server.. The memory increase is due to the replicasets count increasing by thousands wich we cannot explain.
Flavio pointed me at this issue, we have another support case with this problems. Kubewarden when installed spams replicasets and floods the apiserver eating up memory. HE suggested I collect info from the customer and add it here:
• How did you get this metric, is that via prometheus/cAdvisor? If you are referring to the number of replicasets, it is the one reported in Rancher's UI. • Do you happen to have the same policy deployed multiple times, but with different settings? No • Have you configured the number of workers used by the policy server instance? It is running with 3 replicas • Do you happen to be using the experimental Kyverno policy? No • Do you have context aware policies loaded? If so, what kind of resources are they accessing and how many instances of these resources do you have in the cluster? I do not think so. The policies that we have are the following:
NAME POLICY SERVER MUTATING BACKGROUNDAUDIT MODE OBSERVED MODE STATUS AGE disallow-service-loadbalancer default false true protect protect active 26d do-not-run-as-root default true true protect protect active 26d do-not-share-host-paths default false true protect protect active 26d drop-capabilities default true true monitor monitor active 23d kubewarden-mandatorynamespacelabels-policy default true true protect protect active 26d no-host-namespace-sharing default false true protect protect active 26d no-privilege-escalation default true true protect protect active 26d no-privileged-pod default false true protect protect active 26d
All of them have been provided by Rancher, the kubewarden-mandatorynamespacelabels-policy is not using any regex, just checking there is a projectId label to prevent namespaces from being created outside projects.
settings:
mandatory_labels:
- [field.cattle.io/projectId](https://field.cattle.io/projectId)
• Is the memory consumption increasing over the time? I wonder if there's any policy that is leaking memory. No, we have not seen the memory consumption increasing overtime in any of the Kubewarden pods. For instance, every Policy server pod is consuming 700M flat. What happened was related to apiserver in the control plane. I will attach some screenshots from the CPU, memory and disk utilization around the first time it happened.
To clarify the last comment @vincebrannon, it seems that the culprit was misconfigured mutating policies that were interacting fighting with a k8s controller, therefore ReplicaSets were continously created. This has been fixed for default policies (that now only target pods) in Kubewarden 1.10.0. Also, the docs have been expanded with an explanation here.
For general memory consumption, the proposed changes (and more) are now included in the newly released Kubewarden 1.10.0. This release reduces memory footprint, and more importantly, the memory consumption is now constant regardless of number of thread workers or horizontal scalling.
You can read more about it in the 1.10.0 release blogpost.
Closing this card then. Please don't hesitate to reopen, comment here, or open any other card if this becomes an issue again!
While setting up Kubewarden at a demo booth of SUSECon we noticed that policy-server was consuming more memory compared to the other workloads defined inside of the cluster.
This lead to some quick investigation that resulted in this fix. We however decided to schedule more time to look into other improvements.