kyverno / kyverno

Cloud Native Policy Management
https://kyverno.io
Apache License 2.0
5.74k stars 880 forks source link

[Bug] Kyverno creates too many admissionReports #6462

Closed L1ghtman2k closed 1 year ago

L1ghtman2k commented 1 year ago

Kyverno Version

1.9.0

Description

Kyverno creates too many AdmissionReports resulting into etcd running out of space, and cluster becoming unresponsive.

This might somehow be related to one of the clusterpolicies on our clusters matching on *

More details in slack thread.

Current workaround: Setting admissionReports=false

Slack discussion

https://kubernetes.slack.com/archives/CLGR9BJU9/p1677785359616429

Troubleshooting

eddycharly commented 1 year ago

What is your cluster size ? How many reports do you have ?

L1ghtman2k commented 1 year ago

I have included this info in the slack thread, but just in case:

Also, some info about cluster:
number of nodes: 25
k8s version 1.22
number of pods: 760
Kyverno info:
Version1.9
deployed via helm,
all policies are in audit mode
There is a policy that catches */* (initial OOM was caused because we were trying to adjust proper requests/limits)
Limits when OOM happened
kyverno:
  limits_memory: 12144Mi
  requests_cpu: 1200m
  requests_memory: 6072Mi
Adjusted limits:
kyverno:
  limits_memory: 6072Mi
  requests_cpu: 800m
  requests_memory: 1546Mi
L1ghtman2k commented 1 year ago
aibek@DESKTOP-B0KD043:~/GolandProjects/cluster-despina$ k get clusterpolicies.kyverno.io -A
NAME                             BACKGROUND   VALIDATE ACTION   READY   AGE
deny-duplicate-rotator-targets   true         audit             true    23d
deny-modify-platform-label       false        audit             true    23d
disallow-capabilities            true         audit             true    23d
disallow-capabilities-strict     true         audit             true    23d
disallow-host-namespaces         true         audit             true    23d
disallow-host-path               true         audit             true    23d
disallow-host-ports              true         audit             true    23d
disallow-host-process            true         audit             true    23d
disallow-privilege-escalation    true         audit             true    23d
disallow-privileged-containers   true         audit             true    23d
disallow-proc-mount              true         audit             true    23d
disallow-selinux                 true         audit             true    23d
require-pod-probes               false        audit             true    23d
require-run-as-non-root-user     true         audit             true    23d
require-run-as-nonroot           true         audit             true    23d
restrict-apparmor-profiles       true         audit             true    23d
restrict-seccomp                 true         audit             true    23d
restrict-seccomp-strict          true         audit             true    23d
restrict-sysctls                 true         audit             true    23d
restrict-volume-types            true         audit             true    23d
L1ghtman2k commented 1 year ago
Total 3239758 key-value pairs
Entries by 'KEY GROUP' (total 6.1 GB):
+--------------------------------------------------------------------------------+--------------------------------+--------+
|                                   KEY GROUP                                    |              KIND              |  SIZE  |
+--------------------------------------------------------------------------------+--------------------------------+--------+
| /registry/kyverno.io/admissionreports/gateway                                  | AdmissionReport                | 3.2 GB |
| /registry/kyverno.io/admissionreports/agena                                    | AdmissionReport                | 973 MB |
| /registry/kyverno.io/admissionreports/cassini                                  | AdmissionReport                | 580 MB |
| /registry/kyverno.io/admissionreports/mlops                                    | AdmissionReport                | 408 MB |
| /registry/kyverno.io/admissionreports/tigera-operator                          | AdmissionReport                | 154 MB |
| /registry/kyverno.io/admissionreports/titan                                    | AdmissionReport                | 98 MB  |
| /registry/kyverno.io/admissionreports/calico-system                            | AdmissionReport                | 94 MB  |
| /registry/kyverno.io/admissionreports/secrets-rotator-operator                 | AdmissionReport                | 80 MB  |
| /registry/kyverno.io/admissionreports/resource-manager                         | AdmissionReport                | 75 MB  |
| /registry/kyverno.io/admissionreports/ingress-nginx                            | AdmissionReport                | 60 MB  |
| /registry/kyverno.io/admissionreports/iam                                      | AdmissionReport                | 58 MB  |
| /registry/kyverno.io/clusteradmissionreports                                   | ClusterAdmissionReport         | 38 MB  |
| /registry/kyverno.io/admissionreports/vas                                      | AdmissionReport                | 30 MB  |
| /registry/kyverno.io/admissionreports/gts                                      | AdmissionReport                | 21 MB  |
| /registry/kyverno.io/admissionreports/gl-metering                              | AdmissionReport                | 13 MB  |
| /registry/kyverno.io/admissionreports/global-trade-manager                     | AdmissionReport                | 12 MB  |
| /registry/kyverno.io/admissionreports/quake                                    | AdmissionReport                | 11 MB  |
| /registry/kyverno.io/admissionreports/lh-api-orch                              | AdmissionReport                | 10 MB  |
| /registry/kyverno.io/admissionreports/hpcaas                                   | AdmissionReport                | 10 MB  |
---
eddycharly commented 1 year ago

It looks like kyverno cannot clean admission reports in some way. The size in the gateway ns is suspiciously high.

L1ghtman2k commented 1 year ago

I have trouble getting entire contents of gateway namespace because kubectl get times out, but here are the objects that are there, before timeout stops the get process

$ sort yqgateway.yaml | uniq -c
   9567 ConfigMap
     15 CronJob
      8 Deployment
     14 EndpointSlice
     12 Endpoints
     75 HorizontalPodAutoscaler
     24 Ingress
      1 Job
 174448 Lease
      1 NetworkPolicy
     14 Pod
      1 PodDisruptionBudget
      2 Scale
maxwell-gregory commented 1 year ago

I also have noticed massive amounts admission reports in a few of my kyverno installations. This is on v1.8.5 Screenshot 2023-03-06 at 11 07 10 AM

I have tried some of the troubleshooting shown in the documentation:

If just pull the cadmr i randomly get an error partway through generating the table: Error from server (InternalError): an error on the server ("error trying to reach service: tunnel disconnect") has prevented the request from succeeding (get clusteradmissionreports.kyverno.io)

Looks like getting the admission reports with kubectl returns different amounts each time it is ran. I believe it is erroring out as well.

❯ COUNT=$(kubectl get cadmr --no-headers 2> /dev/null | wc -l)
echo "number of cluster admission reports: $COUNT"
number of cluster admission reports:     2000
❯ COUNT=$(kubectl get cadmr --no-headers 2> /dev/null | wc -l)
echo "number of cluster admission reports: $COUNT"
number of cluster admission reports:    23500

Oprhaned count is also all over the place, but its probably most of them

❯ ALL=$(kubectl get admr -A --no-headers 2> /dev/null | wc -l)
NOT_ORPHANS=$(kubectl get admr -A --no-headers -o jsonpath="{range .items[?(@.metadata.ownerReferences[0].uid)]}{.metadata.name}{'\n'}{end}" 2> /dev/null | wc -l)
echo "number of orphan admission reports: $((ALL-NOT_ORPHANS)) ($ALL - $NOT_ORPHANS)"

number of orphan admission reports: -7500 (    3000 -    10500)

I would like to do a watch on the incoming admission reports but i cannot keep the tunnel open for that.

❯ kubectl get admr -A -o wide -w --watch-only
Error from server (InternalError): an error on the server ("error trying to reach service: tunnel disconnect") has prevented the request from succeeding (get admissionreports.kyverno.io)

I think from here it would be nice to clean up existing admission reports so i can get some breathing room and see if there is something chatty using the above command. What is the recommended approach for cleaning up the admission reports? Any objection to a kubectl delete cadmr, admr --all?

EDIT: UPDATE

I have many installations of kyverno and confused version numbers. This is actually running v1.8.0 which is known to have admission reporting issues. I am currently upgrading to v1.8.5. Cluster appears to be cleaning up a lot

L1ghtman2k commented 1 year ago

Issue seems to be related to running Kyverno on large clusters (or rules that match on *) with low or default QPS/Burst settings. Closing this for now, as it doesn't seem to be a bug, but rather a misconfiguration.

UPDATE: Figured I should also update this and note that one can monitor/alert on etcd_db_total_size_in_bytes to potentially prevent this in advance

JimBugwadia commented 1 year ago

@realshuting @eddycharly - can we correlate the QPS to the ARPS, to make some recommendations?

haarchri commented 1 year ago

we have the same issue in some clusters running kyverno in combination with crossplane - problem on our side is that our EKS ControlPlane is unable to startup GC because of this issue - any advise ?

kubectl get --raw=/metrics | grep apiserver_storage_objects | awk '$2>100' | sort -n -k 2.
# HELP apiserver_storage_objects [STABLE] Number of stored objects at the time of last check split by kind.
# TYPE apiserver_storage_objects gauge
apiserver_storage_objects{resource="clusteradmissionreports.kyverno.io"} 2.581064e+06.  
haarchri commented 1 year ago

we running at the moment a cleanup in our cluster with:

for i in {1..10}
do
   time kubectl get --raw "/apis/kyverno.io/v1alpha2/clusteradmissionreports?limit=10000" | jq -r '.items | .[].metadata.name'  | xargs -n 1 -P 400 kubectl delete  clusteradmissionreports 2>/dev/null
done
eddycharly commented 1 year ago

@haarchri what size is your cluster ? do you see throttled requests in the logs ?

haarchri commented 1 year ago

small update - we running this pod now in our cluster: https://gist.github.com/haarchri/90c5f3374686531f953430194d22643a

1000+ CRDS 1500+ Crossplane managed resources 600+ Pods 300+ Jobs

eddycharly commented 1 year ago

Check for throttled requests, this could well be the cause.

krol76 commented 1 year ago

In my opinion it's still a bug, not misconfiguration. In our case etcd database grew ~3GB. In one of namespaces we were not able to list all AdmissionReport resources. The only way to get rid of it was remove crd responsible for Admrs. To workaround this bug the only way right now is to run Kyverno without Admission Reports (flag has to be added to Kyverno container: --admissionReports=false) to be sure it will not crush a cluster.

eddycharly commented 1 year ago

@krol76 what size is your cluster ? do you see throttled requests in the logs ?

krol76 commented 1 year ago

@eddycharly As we cleaned up AdmissionReports it's 3.5GB (earlier 6.5GB). And as we started to run Kyverno without AdmissionReports I can't see any advantages to have it enabled. Policy reports seems to be reliable (at least as I tested it with limited environment there was no differences) even if documentation says are based on admission events and background scans. Additionally what I observed during tests with enabled AdmissionReports - if request is blocked because of policy restriction AdmissionReport is created and takes more time to be removed (has no hash in "Aggregate" column). And finally the most time consuming for API actions (hah - new Kyverno tracing feature!) are connected with AdmissionReports.

eddycharly commented 1 year ago

And finally the most time consuming for API actions (hah - new Kyverno tracing feature!) are connected with AdmissionReports

This could really be related to throttling.

chipzoller commented 1 year ago

Reopened this to investigate a solution. Folks on this thread, are all of you using at least one policy which matches on a wildcard (*)? Anyone not doing this?

JimBugwadia commented 1 year ago

Latest thread on this:

https://kubernetes.slack.com/archives/CLGR9BJU9/p1678371541095359

JimBugwadia commented 1 year ago

@eddycharly - can we try and reproduce this by setting a very low QPS?

Have we looked at the metrics to check the reported Admission Requests per second vs the configured QPS limits? Seems like with a wildcard policy we may have 2+ queries from Kyverno for each matching admission request, right?

krol76 commented 1 year ago

Folks on this thread, are all of you using at least one policy which matches on a wildcard (*)? Anyone not doing this?

Definitely NO wildcards in match statements.

haarchri commented 1 year ago

side note we have also the problem that GC in AWS EKS is not running anymore

2023-03-09T16:56:33.000+01:00   E0307 15:56:33.804756 11 shared_informer.go:243] unable to sync caches for garbage collector

2023-03-09T16:56:33.000+01:00   E0307 15:56:33.804769 11 garbagecollector.go:242] timed out waiting for dependency graph builder sync during GC sync (attempt 31676)
haarchri commented 1 year ago

looks like we running one policy with *

apiVersion: kyverno.io/v1
kind: Policy
metadata:
  name: compliance-common-labels-{{ $.Values.namespace }}
  namespace: {{ $.Values.namespace }}
  annotations:
    policies.kyverno.io/title: Common Labels
    policies.kyverno.io/category: Compliance
    policies.kyverno.io/severity: medium
    policies.kyverno.io/subject: Label
    policies.kyverno.io/description: Common Labels are required for all workloads on Cluster.
spec:
  # schemaValidation combined with filter `kind: '*'` issues and validation error "invalid kind" with TokenRequet resources
  # See https://github.com/kyverno/kyverno/issues/5136#issuecomment-1370618333 and following comments
  # TODO remove with >=1.9
  schemaValidation: false
  background: false
  rules:
  - name: add-common-labels
    match:
      resources:
        kinds:
        - "*"
    mutate:
      patchStrategicMerge:
        metadata:
          labels:
haarchri commented 1 year ago

for us its also not clear why this change comes to 1.8.x release https://github.com/kyverno/kyverno/pull/5034 - any guidance ? we best go to 300 when we running with crossplane - like the providers in the eco-system https://github.com/crossplane-contrib/provider-helm/pull/179 or upstream k8s https://github.com/kubernetes/kubernetes/pull/109141

haarchri commented 1 year ago

thanks to @eddycharly to open https://github.com/kyverno/kyverno/pull/6522

eddycharly commented 1 year ago

another thing that can help is the number of workers, we have currently 2 workers, will bump to 10.

chipzoller commented 1 year ago

We're going to cut 1.9.2-rc1 today. Please try with the tag for this to see if resolves the issue.

riuvshyn commented 1 year ago

another thing that can help is the number of workers, we have currently 2 workers, will bump to 10.

I've tried to bump workers count with --backgroundScanWorkers=5, I am not quite sure that this is a proper fix, this ramps up CPU usage dramatically (obviously) and all create even more load on API and lot's of requests are still going to be throttled. Maybe it needs some guidance for setting up correct k8s P&F for large clusters?

eddycharly commented 1 year ago

backgroundScanWorkers does not control the number of workers in the admission reports controller.

1.9.2-rc.1 is being cut, please give it a try.

riuvshyn commented 1 year ago

@eddycharly oh, I thought backgroundScanWorkers are the only workers there. Thanks for clarification!

eddycharly commented 1 year ago

backgroundScanWorkers is for background scan, it does not influence admission reports controller.

chipzoller commented 1 year ago

1.9.2-rc1 is now available. Please give it a try.

chipzoller commented 1 year ago

This might either already be addressed in 1.9.1 or 1.9.2, or may be addressed by https://github.com/kyverno/kyverno/pull/6568.

devmechanic commented 1 year ago

We upgraded from 1.7.3 to 1.8.5 last week and this is what happened :) kubectl get admr -A --no-headers | wc -l 93463

kubectl get cadmr --no-headers | wc -l 193077

chipzoller commented 1 year ago

Should be addressed in 1.9.2 based on existing changes. Closing, please re-open after testing on 1.9.2 if still exists.

amit-disc commented 1 year ago
Total 3239758 key-value pairs
Entries by 'KEY GROUP' (total 6.1 GB):
+--------------------------------------------------------------------------------+--------------------------------+--------+
|                                   KEY GROUP                                    |              KIND              |  SIZE  |
+--------------------------------------------------------------------------------+--------------------------------+--------+
| /registry/kyverno.io/admissionreports/gateway                                  | AdmissionReport                | 3.2 GB |
| /registry/kyverno.io/admissionreports/agena                                    | AdmissionReport                | 973 MB |
| /registry/kyverno.io/admissionreports/cassini                                  | AdmissionReport                | 580 MB |
| /registry/kyverno.io/admissionreports/mlops                                    | AdmissionReport                | 408 MB |
| /registry/kyverno.io/admissionreports/tigera-operator                          | AdmissionReport                | 154 MB |
| /registry/kyverno.io/admissionreports/titan                                    | AdmissionReport                | 98 MB  |
| /registry/kyverno.io/admissionreports/calico-system                            | AdmissionReport                | 94 MB  |
| /registry/kyverno.io/admissionreports/secrets-rotator-operator                 | AdmissionReport                | 80 MB  |
| /registry/kyverno.io/admissionreports/resource-manager                         | AdmissionReport                | 75 MB  |
| /registry/kyverno.io/admissionreports/ingress-nginx                            | AdmissionReport                | 60 MB  |
| /registry/kyverno.io/admissionreports/iam                                      | AdmissionReport                | 58 MB  |
| /registry/kyverno.io/clusteradmissionreports                                   | ClusterAdmissionReport         | 38 MB  |
| /registry/kyverno.io/admissionreports/vas                                      | AdmissionReport                | 30 MB  |
| /registry/kyverno.io/admissionreports/gts                                      | AdmissionReport                | 21 MB  |
| /registry/kyverno.io/admissionreports/gl-metering                              | AdmissionReport                | 13 MB  |
| /registry/kyverno.io/admissionreports/global-trade-manager                     | AdmissionReport                | 12 MB  |
| /registry/kyverno.io/admissionreports/quake                                    | AdmissionReport                | 11 MB  |
| /registry/kyverno.io/admissionreports/lh-api-orch                              | AdmissionReport                | 10 MB  |
| /registry/kyverno.io/admissionreports/hpcaas                                   | AdmissionReport                | 10 MB  |
---

@L1ghtman2k How do you calculate this size? Any tool?

chipzoller commented 1 year ago

See some of the info in PR #6949

L1ghtman2k commented 1 year ago

This info was provided to me by AWS guys since we run on EKS. But I assume they have just directly queried etcd nodes