Policy Server: reduce memory usage

flavio commented 1 year ago

While setting up Kubewarden at a demo booth of SUSECon we noticed that policy-server was consuming more memory compared to the other workloads defined inside of the cluster.

This lead to some quick investigation that resulted in this fix. We however decided to schedule more time to look into other improvements.

flavio commented 1 year ago

I've started looking into this issue during the last days.

TLDR: I think we can close this issue because we do not have high memory usage. There are some changes with regards to our architecture that we could do, but these would reduce the request per seconds that can be processed by a single instance of Policy Server.

The environment

I've ran the tests against Kubewarden 1.7.0-rc3. The tests have been done running a Policy Server with replicaSize 1.

Everything has been installed from our helm charts. I've enabled the recommended policies of the kubewarden-default chart, plus I've added the ones used by our load-testing suite.

It's important to highlight I've not loaded any of the new "vanilla WASI" policies introduced by 1.7.0. The numbers shown inside of the memory reduction PR were skewed by the usage of the go-wasi-template policy. This is a policy built with the official Go compiler. It's huge (~20 Mb) and provides a distorted picture of our memory usage, see this comment I've just added on the closed PR.

To recap, I've been testing a Policy Server hosting 10 policies:

Contents of policies.yml

```yaml clusterwide-do-not-run-as-root: namespacedName: Namespace: '' Name: do-not-run-as-root url: ghcr.io/kubewarden/policies/user-group-psp:v0.4.9 policyMode: monitor allowedToMutate: true settings: run_as_group: rule: RunAsAny run_as_user: rule: MustRunAsNonRoot supplemental_groups: rule: RunAsAny clusterwide-do-not-share-host-paths: namespacedName: Namespace: '' Name: do-not-share-host-paths url: ghcr.io/kubewarden/policies/hostpaths-psp:v0.1.9 policyMode: monitor allowedToMutate: false settings: allowedHostPaths: - pathPrefix: "/tmp" readOnly: true clusterwide-drop-capabilities: namespacedName: Namespace: '' Name: drop-capabilities url: ghcr.io/kubewarden/policies/capabilities-psp:v0.1.13 policyMode: monitor allowedToMutate: true settings: required_drop_capabilities: - ALL clusterwide-no-host-namespace-sharing: namespacedName: Namespace: '' Name: no-host-namespace-sharing url: ghcr.io/kubewarden/policies/host-namespaces-psp:v0.1.6 policyMode: monitor allowedToMutate: false settings: allow_host_ipc: false allow_host_network: false allow_host_pid: false clusterwide-no-privilege-escalation: namespacedName: Namespace: '' Name: no-privilege-escalation url: ghcr.io/kubewarden/policies/allow-privilege-escalation-psp:v0.2.6 policyMode: monitor allowedToMutate: true settings: default_allow_privilege_escalation: false clusterwide-no-privileged-pod: namespacedName: Namespace: '' Name: no-privileged-pod url: ghcr.io/kubewarden/policies/pod-privileged:v0.2.7 policyMode: monitor allowedToMutate: false settings: namespaced-default-policy-echo: namespacedName: Namespace: default Name: policy-echo url: ghcr.io/kubewarden/policies/echo:latest policyMode: protect allowedToMutate: false settings: {} namespaced-default-psp-apparmor: namespacedName: Namespace: default Name: psp-apparmor url: ghcr.io/kubewarden/policies/apparmor-psp:v0.1.9 policyMode: protect allowedToMutate: false settings: allowed_profiles: - runtime/default namespaced-default-psp-capabilities: namespacedName: Namespace: default Name: psp-capabilities url: ghcr.io/kubewarden/policies/capabilities-psp:v0.1.9 policyMode: protect allowedToMutate: true settings: allowed_capabilities: - CHOWN required_drop_capabilities: - NET_ADMIN namespaced-default-psp-user-group: namespacedName: Namespace: default Name: psp-user-group url: ghcr.io/kubewarden/policies/user-group-psp:v0.2.0 policyMode: protect allowedToMutate: true settings: run_as_group: rule: RunAsAny run_as_user: rule: RunAsAny supplemental_groups: rule: RunAsAny ```

20 modules) takes 6 GB of virtual memory. The reason that leads to this memory usage boils down to how wasmtime JIT optimizes the module to achieve the highest execution speed. This is all documented [here](https://docs.rs/wasmtime/latest/wasmtime/struct.Config.html#should-you-use-static-or-dynamic-memory) and [here](https://docs.rs/wasmtime/latest/wasmtime/struct.Config.html#how-big-should-the-guard-be). Long story short: by default a `wasmtime::Engine` performs a static allocation of the module memory because it provides better performance. On 64 bit architectures each module is assigned a 4GB address space, on top of that an extra 2 GB address space is added as a guard, to further optimize the compilation of the Wasm instructions to native code. Having a process with high virtual memory is not unusual and is not an issue. We can safely ignore that. As a matter of fact, the process virtual memory is not even shown by the metric server because is not relevant. **Update:** [this](https://gist.github.com/flavio/94a80b7b2ce2bf318ba8968288751a7d) is the output produced by running `smap` against the demo code I wrote. The one that instantiated one Wasm module using wasmtime. ## Further work The current code loads all the Wasm modules, optimizes them and keeps them in memory inside of a list of `wasmtime::Module` objects. This list is then shared with all the workers that are part of our worker pool. Each worker then creates one [`wapc::WapcHost`](https://docs.rs/wapc/1.1.0/wapc/struct.WapcHost.html) instance per policy to be evaluated. ### Reduce the number of `wapc::WapcHost` A user might define 10 policies, all based on the same Wasm module. Each one of them with slightly different configuration values. This will result in only 1 instance of `wasmtime::Module`. However, this will lead to 10 `wapc::WapcHost` instances. Theoretically we could optimize for this kind of scenarios to have just one instance of `wapc::WapcHost`; however that would pose some challenges with the mechanism we use to identify which policy is performing a host callback. Right now we leverage the unique ID assigned to each `wapc::WapcHost` to understand which policy is making a host callback and take decisions. Conflating multiple policies to have the same ID would cause visibility issues (it would not be possible to understand which policy generated a trace event) and security issue (two policies might have different settings with regards to the Kubernetes resources they are granted access to). We can think of a way to reduce the number of instances of `wapc::WapcHost` inside of the workers, without having to conflate everything ### Spawn `wapc::WapcHost` on demand Each worker has one `wapc::WapcHost` ready for each policy that has been defined by the user. Having this object ready allows us to have a quick evaluation of the incoming request. I've done a test and changed the current code to create the `wapc::WapcHost` instances on demand, and just drop them once the evaluation is done. This is summarized by these sequence charts

Current architecture

```mermaid sequenceDiagram Webhook->>+Worker: process admission request Worker-->>Worker: Wasm eval Worker->>-Webhook: evaluation result ```

New architecture

```mermaid sequenceDiagram Webhook->>+Worker: process admission request Worker-->>Worker: create wapcHost Worker-->>Worker: Wasm eval Worker->>-Webhook: evaluation result Worker-->>Worker: delete WapcHost ```

This approach lead to a lower memory usage, but it caused a massive drop in terms of performances. The request per seconds went from ~1100 down to ~90. The amount of time required to setup a `wapc::WapcHost` instance is causing a this performance drop. Right now, this quick POC I wrote leverages the public API of wapc. That means we create a [`wasmtime::Instance`](https://docs.rs/wasmtime/latest/wasmtime/struct.Instance.html) object from scratch, on each evaluation. I think we can improve the performances by saving some computation via the usage of a [`wasmtime::InstancePre`](https://docs.rs/wasmtime/latest/wasmtime/struct.InstancePre.html). Internally the `wasmtime-provider` crate already uses it (see [this](https://github.com/wapc/wapc-rs/blob/538a604d15f95cf1423e4108277ebe79b09a5388/crates/wasmtime-provider/src/lib.rs#L159) code I've added), we could try exposing that to the consumers of the library. However, I think it will be hard to go from the 90 RPS back to 1100 RPS with this change. ### Have a dynamic worker pool Currently we size of the worker pool can be either specified by the user or can be calculated starting from the number of CPU cores. We could allow the user to set a minimum and a maximum number of active workers. We could then grow and shrink the pool depending on the amount of incoming requests to be evaluated. This would allow an idle server to have a smaller memory consumption. ## Summary I think the current memory usage is fine. We can do further optimizations, but I think right now we have more important issues to address. Given the time, I think I would definitely try to investigate how we can reuse `wapc::WapcHost` instances and how we can implement a dynamic worker pool. I think going to the on-demand model would be good for memory usage, but it will never provide the high throughput of requests that we currently offer.

flavio commented 1 year ago

Another possible activity for the future: rewrite the load-tests to make use of k6.

Currently the load-tests are written with locus, which is doing a fine job. However, if we were to migrate to k6 we could then have a performance testing environment made of:

prometheus: to collect node and pods resource usage
k6: to generate load
grafana: to inspect the data collected

The most interesting - and appealing - advantage of this solution is the ability to correlate the load generated by the k6 suite and the resource usage. Right now this analysis required quite some manual work from my side. It would be great to have something that produces better and more consistent results.

IMHO this should be the 1st thing we should address

jvanz commented 1 year ago

I agree with @flavio's arguments and we can leave this enhancements for the future. Maybe we can keep the improvements in the performance tests in the queue and leave the further changes for the future. Therefore, when we found any possible issue, we bring those improvements to the table again.

viccuad commented 1 year ago

This is incredible work, thanks for the insights and the learning pointers.

From what is presented, I agree on the priorities, and I would also like to tackle a dynamic worker pool and trying wasmtime::InstancePre, yet I prefer to punt on this. I would start with refactoring the load-tests with k6.

ish-xyz commented 10 months ago

We are using Kubewarden in some of our staging clusters and with 61 cluster-wide policies deployed the policy-server is using a lot of memory (40-50GB). It seems like that the memory used from the policy-server is linearly scaled with the number of policies and seems a bit excessive.

Any suggestion or optimization I could do on my side?

flavio commented 10 months ago

Can you provide some details about your usage pattern, this could help us fine tune the optimization ideas we have.

Some questions:

How did you get this metric, is that via prometheus/cAdvisor?
Do you happen to have the same policy deployed multiple times, but with different settings?
Have you configured the number of workers used by the policy server instance?
Do you happen to be using the experimental Kyverno policy?
~~What is the percentage of ClusterAdmissionPolicy vs AdmissionPolicy? Do you have the same (policy + settings) policy deployed inside of multiple namespaces via AdmissionPolicy?~~ Scratch it, you said you're using only ClusterAdmissionPolicies
Do you have context aware policies loaded? If so, what kind of resources are they accessing and how many instances of these resources do you have in the cluster?
Is the memory consumption increasing over the time? I wonder if there's any policy that is leaking memory

flavio commented 10 months ago

@ish-xyz I'm currently working on a fix for this issue. Can you please provide more information about your environment (see previous comment)?

flavio commented 10 months ago

JFYI: there's a PR under review that reduces the amount of memory consume: https://github.com/kubewarden/policy-server/pull/596

ish-xyz commented 10 months ago

@flavio How did you get this metric, is that via prometheus/cAdvisor? Yes prometheus. Do you happen to have the same policy deployed multiple times, but with different settings? That's correct, we use kubewarden as PSP replacement Have you configured the number of workers used by the policy server instance? We are using the default settings here Do you happen to be using the experimental Kyverno policy? No Do you have context aware policies loaded? Unsure, need to double check. I'll get back to you on this one Is the memory consumption increasing over the time? Yes, it looks like it I wonder if there's any policy that is leaking memory Could be, I'lll double check this

flavio commented 9 months ago

@ish-xyz we've finished a major refactor of the policy-server codebase (see https://github.com/kubewarden/policy-server/pull/596#issuecomment-1856344986).

Beginning of next year (Jan 2024), we are going to release a new version of Kubewarden. In the meantime it would be great if you could give a quick test to the changes by consuming the latest version of the policy-server image.

vincebrannon commented 8 months ago

@ish-xyz EDIT Ignore this as there was confusion between me and Flavio. The issue is with kubeapi and not policy server.. The memory increase is due to the replicasets count increasing by thousands wich we cannot explain.

Flavio pointed me at this issue, we have another support case with this problems. Kubewarden when installed spams replicasets and floods the apiserver eating up memory. HE suggested I collect info from the customer and add it here:

• How did you get this metric, is that via prometheus/cAdvisor? If you are referring to the number of replicasets, it is the one reported in Rancher's UI. • Do you happen to have the same policy deployed multiple times, but with different settings? No • Have you configured the number of workers used by the policy server instance? It is running with 3 replicas • Do you happen to be using the experimental Kyverno policy? No • Do you have context aware policies loaded? If so, what kind of resources are they accessing and how many instances of these resources do you have in the cluster? I do not think so. The policies that we have are the following:

NAME POLICY SERVER MUTATING BACKGROUNDAUDIT MODE OBSERVED MODE STATUS AGE disallow-service-loadbalancer default false true protect protect active 26d do-not-run-as-root default true true protect protect active 26d do-not-share-host-paths default false true protect protect active 26d drop-capabilities default true true monitor monitor active 23d kubewarden-mandatorynamespacelabels-policy default true true protect protect active 26d no-host-namespace-sharing default false true protect protect active 26d no-privilege-escalation default true true protect protect active 26d no-privileged-pod default false true protect protect active 26d

All of them have been provided by Rancher, the kubewarden-mandatorynamespacelabels-policy is not using any regex, just checking there is a projectId label to prevent namespaces from being created outside projects.

settings:
    mandatory_labels:
      - [field.cattle.io/projectId](https://field.cattle.io/projectId)

• Is the memory consumption increasing over the time? I wonder if there's any policy that is leaking memory. No, we have not seen the memory consumption increasing overtime in any of the Kubewarden pods. For instance, every Policy server pod is consuming 700M flat. What happened was related to apiserver in the control plane. I will attach some screenshots from the CPU, memory and disk utilization around the first time it happened. CPU disk memory

viccuad commented 8 months ago

To clarify the last comment @vincebrannon, it seems that the culprit was misconfigured mutating policies that were interacting fighting with a k8s controller, therefore ReplicaSets were continously created. This has been fixed for default policies (that now only target pods) in Kubewarden 1.10.0. Also, the docs have been expanded with an explanation here.

viccuad commented 8 months ago

For general memory consumption, the proposed changes (and more) are now included in the newly released Kubewarden 1.10.0. This release reduces memory footprint, and more importantly, the memory consumption is now constant regardless of number of thread workers or horizontal scalling.

You can read more about it in the 1.10.0 release blogpost.

Closing this card then. Please don't hesitate to reopen, comment here, or open any other card if this becomes an issue again!

kubewarden / kubewarden-controller

Policy Server: reduce memory usage #528

The environment