Closed L1ghtman2k closed 1 year ago
What is your cluster size ? How many reports do you have ?
I have included this info in the slack thread, but just in case:
Also, some info about cluster:
number of nodes: 25
k8s version 1.22
number of pods: 760
Kyverno info:
Version1.9
deployed via helm,
all policies are in audit mode
There is a policy that catches */* (initial OOM was caused because we were trying to adjust proper requests/limits)
Limits when OOM happened
kyverno:
limits_memory: 12144Mi
requests_cpu: 1200m
requests_memory: 6072Mi
Adjusted limits:
kyverno:
limits_memory: 6072Mi
requests_cpu: 800m
requests_memory: 1546Mi
aibek@DESKTOP-B0KD043:~/GolandProjects/cluster-despina$ k get clusterpolicies.kyverno.io -A
NAME BACKGROUND VALIDATE ACTION READY AGE
deny-duplicate-rotator-targets true audit true 23d
deny-modify-platform-label false audit true 23d
disallow-capabilities true audit true 23d
disallow-capabilities-strict true audit true 23d
disallow-host-namespaces true audit true 23d
disallow-host-path true audit true 23d
disallow-host-ports true audit true 23d
disallow-host-process true audit true 23d
disallow-privilege-escalation true audit true 23d
disallow-privileged-containers true audit true 23d
disallow-proc-mount true audit true 23d
disallow-selinux true audit true 23d
require-pod-probes false audit true 23d
require-run-as-non-root-user true audit true 23d
require-run-as-nonroot true audit true 23d
restrict-apparmor-profiles true audit true 23d
restrict-seccomp true audit true 23d
restrict-seccomp-strict true audit true 23d
restrict-sysctls true audit true 23d
restrict-volume-types true audit true 23d
Total 3239758 key-value pairs
Entries by 'KEY GROUP' (total 6.1 GB):
+--------------------------------------------------------------------------------+--------------------------------+--------+
| KEY GROUP | KIND | SIZE |
+--------------------------------------------------------------------------------+--------------------------------+--------+
| /registry/kyverno.io/admissionreports/gateway | AdmissionReport | 3.2 GB |
| /registry/kyverno.io/admissionreports/agena | AdmissionReport | 973 MB |
| /registry/kyverno.io/admissionreports/cassini | AdmissionReport | 580 MB |
| /registry/kyverno.io/admissionreports/mlops | AdmissionReport | 408 MB |
| /registry/kyverno.io/admissionreports/tigera-operator | AdmissionReport | 154 MB |
| /registry/kyverno.io/admissionreports/titan | AdmissionReport | 98 MB |
| /registry/kyverno.io/admissionreports/calico-system | AdmissionReport | 94 MB |
| /registry/kyverno.io/admissionreports/secrets-rotator-operator | AdmissionReport | 80 MB |
| /registry/kyverno.io/admissionreports/resource-manager | AdmissionReport | 75 MB |
| /registry/kyverno.io/admissionreports/ingress-nginx | AdmissionReport | 60 MB |
| /registry/kyverno.io/admissionreports/iam | AdmissionReport | 58 MB |
| /registry/kyverno.io/clusteradmissionreports | ClusterAdmissionReport | 38 MB |
| /registry/kyverno.io/admissionreports/vas | AdmissionReport | 30 MB |
| /registry/kyverno.io/admissionreports/gts | AdmissionReport | 21 MB |
| /registry/kyverno.io/admissionreports/gl-metering | AdmissionReport | 13 MB |
| /registry/kyverno.io/admissionreports/global-trade-manager | AdmissionReport | 12 MB |
| /registry/kyverno.io/admissionreports/quake | AdmissionReport | 11 MB |
| /registry/kyverno.io/admissionreports/lh-api-orch | AdmissionReport | 10 MB |
| /registry/kyverno.io/admissionreports/hpcaas | AdmissionReport | 10 MB |
---
It looks like kyverno cannot clean admission reports in some way.
The size in the gateway
ns is suspiciously high.
I have trouble getting entire contents of gateway namespace because kubectl get
times out, but here are the objects that are there, before timeout stops the get process
$ sort yqgateway.yaml | uniq -c
9567 ConfigMap
15 CronJob
8 Deployment
14 EndpointSlice
12 Endpoints
75 HorizontalPodAutoscaler
24 Ingress
1 Job
174448 Lease
1 NetworkPolicy
14 Pod
1 PodDisruptionBudget
2 Scale
I also have noticed massive amounts admission reports in a few of my kyverno installations. This is on v1.8.5
I have tried some of the troubleshooting shown in the documentation:
If just pull the cadmr i randomly get an error partway through generating the table:
Error from server (InternalError): an error on the server ("error trying to reach service: tunnel disconnect") has prevented the request from succeeding (get clusteradmissionreports.kyverno.io)
Looks like getting the admission reports with kubectl returns different amounts each time it is ran. I believe it is erroring out as well.
❯ COUNT=$(kubectl get cadmr --no-headers 2> /dev/null | wc -l)
echo "number of cluster admission reports: $COUNT"
number of cluster admission reports: 2000
❯ COUNT=$(kubectl get cadmr --no-headers 2> /dev/null | wc -l)
echo "number of cluster admission reports: $COUNT"
number of cluster admission reports: 23500
Oprhaned count is also all over the place, but its probably most of them
❯ ALL=$(kubectl get admr -A --no-headers 2> /dev/null | wc -l)
NOT_ORPHANS=$(kubectl get admr -A --no-headers -o jsonpath="{range .items[?(@.metadata.ownerReferences[0].uid)]}{.metadata.name}{'\n'}{end}" 2> /dev/null | wc -l)
echo "number of orphan admission reports: $((ALL-NOT_ORPHANS)) ($ALL - $NOT_ORPHANS)"
number of orphan admission reports: -7500 ( 3000 - 10500)
I would like to do a watch on the incoming admission reports but i cannot keep the tunnel open for that.
❯ kubectl get admr -A -o wide -w --watch-only
Error from server (InternalError): an error on the server ("error trying to reach service: tunnel disconnect") has prevented the request from succeeding (get admissionreports.kyverno.io)
I think from here it would be nice to clean up existing admission reports so i can get some breathing room and see if there is something chatty using the above command. What is the recommended approach for cleaning up the admission reports? Any objection to a kubectl delete cadmr, admr --all
?
I have many installations of kyverno and confused version numbers. This is actually running v1.8.0 which is known to have admission reporting issues. I am currently upgrading to v1.8.5. Cluster appears to be cleaning up a lot
Issue seems to be related to running Kyverno on large clusters (or rules that match on *
) with low or default QPS/Burst settings. Closing this for now, as it doesn't seem to be a bug, but rather a misconfiguration.
UPDATE: Figured I should also update this and note that one can monitor/alert on etcd_db_total_size_in_bytes
to potentially prevent this in advance
@realshuting @eddycharly - can we correlate the QPS to the ARPS, to make some recommendations?
we have the same issue in some clusters running kyverno in combination with crossplane - problem on our side is that our EKS ControlPlane is unable to startup GC because of this issue - any advise ?
kubectl get --raw=/metrics | grep apiserver_storage_objects | awk '$2>100' | sort -n -k 2.
# HELP apiserver_storage_objects [STABLE] Number of stored objects at the time of last check split by kind.
# TYPE apiserver_storage_objects gauge
apiserver_storage_objects{resource="clusteradmissionreports.kyverno.io"} 2.581064e+06.
we running at the moment a cleanup in our cluster with:
for i in {1..10}
do
time kubectl get --raw "/apis/kyverno.io/v1alpha2/clusteradmissionreports?limit=10000" | jq -r '.items | .[].metadata.name' | xargs -n 1 -P 400 kubectl delete clusteradmissionreports 2>/dev/null
done
@haarchri what size is your cluster ? do you see throttled requests in the logs ?
small update - we running this pod now in our cluster: https://gist.github.com/haarchri/90c5f3374686531f953430194d22643a
1000+ CRDS 1500+ Crossplane managed resources 600+ Pods 300+ Jobs
Check for throttled requests, this could well be the cause.
In my opinion it's still a bug, not misconfiguration. In our case etcd database grew ~3GB. In one of namespaces we were not able to list all AdmissionReport resources. The only way to get rid of it was remove crd responsible for Admrs. To workaround this bug the only way right now is to run Kyverno without Admission Reports (flag has to be added to Kyverno container: --admissionReports=false) to be sure it will not crush a cluster.
@krol76 what size is your cluster ? do you see throttled requests in the logs ?
@eddycharly As we cleaned up AdmissionReports it's 3.5GB (earlier 6.5GB). And as we started to run Kyverno without AdmissionReports I can't see any advantages to have it enabled. Policy reports seems to be reliable (at least as I tested it with limited environment there was no differences) even if documentation says are based on admission events and background scans. Additionally what I observed during tests with enabled AdmissionReports - if request is blocked because of policy restriction AdmissionReport is created and takes more time to be removed (has no hash in "Aggregate" column). And finally the most time consuming for API actions (hah - new Kyverno tracing feature!) are connected with AdmissionReports.
And finally the most time consuming for API actions (hah - new Kyverno tracing feature!) are connected with AdmissionReports
This could really be related to throttling.
Reopened this to investigate a solution. Folks on this thread, are all of you using at least one policy which matches on a wildcard (*
)? Anyone not doing this?
Latest thread on this:
https://kubernetes.slack.com/archives/CLGR9BJU9/p1678371541095359
@eddycharly - can we try and reproduce this by setting a very low QPS?
Have we looked at the metrics to check the reported Admission Requests per second vs the configured QPS limits? Seems like with a wildcard policy we may have 2+ queries from Kyverno for each matching admission request, right?
Folks on this thread, are all of you using at least one policy which matches on a wildcard (
*
)? Anyone not doing this?
Definitely NO wildcards in match statements.
side note we have also the problem that GC in AWS EKS is not running anymore
2023-03-09T16:56:33.000+01:00 E0307 15:56:33.804756 11 shared_informer.go:243] unable to sync caches for garbage collector
2023-03-09T16:56:33.000+01:00 E0307 15:56:33.804769 11 garbagecollector.go:242] timed out waiting for dependency graph builder sync during GC sync (attempt 31676)
looks like we running one policy with *
apiVersion: kyverno.io/v1
kind: Policy
metadata:
name: compliance-common-labels-{{ $.Values.namespace }}
namespace: {{ $.Values.namespace }}
annotations:
policies.kyverno.io/title: Common Labels
policies.kyverno.io/category: Compliance
policies.kyverno.io/severity: medium
policies.kyverno.io/subject: Label
policies.kyverno.io/description: Common Labels are required for all workloads on Cluster.
spec:
# schemaValidation combined with filter `kind: '*'` issues and validation error "invalid kind" with TokenRequet resources
# See https://github.com/kyverno/kyverno/issues/5136#issuecomment-1370618333 and following comments
# TODO remove with >=1.9
schemaValidation: false
background: false
rules:
- name: add-common-labels
match:
resources:
kinds:
- "*"
mutate:
patchStrategicMerge:
metadata:
labels:
for us its also not clear why this change comes to 1.8.x release https://github.com/kyverno/kyverno/pull/5034 - any guidance ? we best go to 300 when we running with crossplane - like the providers in the eco-system https://github.com/crossplane-contrib/provider-helm/pull/179 or upstream k8s https://github.com/kubernetes/kubernetes/pull/109141
thanks to @eddycharly to open https://github.com/kyverno/kyverno/pull/6522
another thing that can help is the number of workers, we have currently 2 workers, will bump to 10.
We're going to cut 1.9.2-rc1 today. Please try with the tag for this to see if resolves the issue.
another thing that can help is the number of workers, we have currently 2 workers, will bump to 10.
I've tried to bump workers count with --backgroundScanWorkers=5
, I am not quite sure that this is a proper fix, this ramps up CPU usage dramatically (obviously) and all create even more load on API and lot's of requests are still going to be throttled. Maybe it needs some guidance for setting up correct k8s P&F for large clusters?
backgroundScanWorkers
does not control the number of workers in the admission reports controller.
1.9.2-rc.1
is being cut, please give it a try.
@eddycharly oh, I thought backgroundScanWorkers are the only workers there. Thanks for clarification!
backgroundScanWorkers
is for background scan, it does not influence admission reports controller.
1.9.2-rc1 is now available. Please give it a try.
This might either already be addressed in 1.9.1 or 1.9.2, or may be addressed by https://github.com/kyverno/kyverno/pull/6568.
We upgraded from 1.7.3 to 1.8.5 last week and this is what happened :) kubectl get admr -A --no-headers | wc -l 93463
kubectl get cadmr --no-headers | wc -l 193077
Should be addressed in 1.9.2 based on existing changes. Closing, please re-open after testing on 1.9.2 if still exists.
Total 3239758 key-value pairs Entries by 'KEY GROUP' (total 6.1 GB): +--------------------------------------------------------------------------------+--------------------------------+--------+ | KEY GROUP | KIND | SIZE | +--------------------------------------------------------------------------------+--------------------------------+--------+ | /registry/kyverno.io/admissionreports/gateway | AdmissionReport | 3.2 GB | | /registry/kyverno.io/admissionreports/agena | AdmissionReport | 973 MB | | /registry/kyverno.io/admissionreports/cassini | AdmissionReport | 580 MB | | /registry/kyverno.io/admissionreports/mlops | AdmissionReport | 408 MB | | /registry/kyverno.io/admissionreports/tigera-operator | AdmissionReport | 154 MB | | /registry/kyverno.io/admissionreports/titan | AdmissionReport | 98 MB | | /registry/kyverno.io/admissionreports/calico-system | AdmissionReport | 94 MB | | /registry/kyverno.io/admissionreports/secrets-rotator-operator | AdmissionReport | 80 MB | | /registry/kyverno.io/admissionreports/resource-manager | AdmissionReport | 75 MB | | /registry/kyverno.io/admissionreports/ingress-nginx | AdmissionReport | 60 MB | | /registry/kyverno.io/admissionreports/iam | AdmissionReport | 58 MB | | /registry/kyverno.io/clusteradmissionreports | ClusterAdmissionReport | 38 MB | | /registry/kyverno.io/admissionreports/vas | AdmissionReport | 30 MB | | /registry/kyverno.io/admissionreports/gts | AdmissionReport | 21 MB | | /registry/kyverno.io/admissionreports/gl-metering | AdmissionReport | 13 MB | | /registry/kyverno.io/admissionreports/global-trade-manager | AdmissionReport | 12 MB | | /registry/kyverno.io/admissionreports/quake | AdmissionReport | 11 MB | | /registry/kyverno.io/admissionreports/lh-api-orch | AdmissionReport | 10 MB | | /registry/kyverno.io/admissionreports/hpcaas | AdmissionReport | 10 MB | ---
@L1ghtman2k How do you calculate this size? Any tool?
See some of the info in PR #6949
This info was provided to me by AWS guys since we run on EKS. But I assume they have just directly queried etcd nodes
Kyverno Version
1.9.0
Description
Kyverno creates too many AdmissionReports resulting into etcd running out of space, and cluster becoming unresponsive.
This might somehow be related to one of the clusterpolicies on our clusters matching on
*
More details in slack thread.
Current workaround: Setting
admissionReports=false
Slack discussion
https://kubernetes.slack.com/archives/CLGR9BJU9/p1677785359616429
Troubleshooting