[Bug] Kyverno - Not letting EKS worker nodes come up when whole cluster is going down

anuragjain08 commented 2 months ago

Kyverno Version

1.12.5

Kubernetes Version

1.29.x

Kubernetes Platform

EKS

Kyverno Rule Type

Validate

Description

Installed kyverno via helm chart version 3.2.6 and kyverno version 1.12.5. By default via config map and webhooks, kube-system namespace was excluded from admission controller and kyverno namespace was excluded because of this excludeKyvernoNamespace. Kyverno got installed properly. I also installed policy-reporter-ui in the kyverno namespace only via helm chart. This was also working as expected. I applied below three policies in the audit mode, as below.

apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: restrict-annotations-kube2iam annotations: policies.kyverno.io/title: Restrict Annotations policies.kyverno.io/category: Annotation policies.kyverno.io/subject: Pod, Annotation policies.kyverno.io/description: >- This policy prevents the use of an annotation beginning withiam.amazonaws.com/role`. This can be useful to ensure users either don't set reserved annotations or to force them to use a newer version of an annotation. spec: validationFailureAction: audit background: true rules:
- name: block-kube2iam match: any:
  - resources: kinds:
    - Deployment
    - CronJob
    - Job
    - StatefulSet
    - DaemonSet
    - Pod validate: message: Cannot use kube2iam annotation. pattern: metadata: =(annotations): X(iam.amazonaws.com/role): "*?"`
`apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-labels annotations: policies.kyverno.io/title: Require Labels policies.kyverno.io/category: Best Practices policies.kyverno.io/subject: Pod, Label policies.kyverno.io/description: >- Define and use labels that identify semantic attributes of your application or Deployment. A common set of labels allows tools to work collaboratively, describing objects in a common manner that all tools can understand. The recommended labels describe applications in a way that can be queried. This policy validates that the below labels are specified with some value.
spec: validationFailureAction: audit background: true rules:
- name: check-for-labels match: any:
  - resources: kinds:
    - Pod validate: message: "The below labels are required." pattern: metadata: labels: aa: "aa || aaa || aaaaa" bb: "?*" cc: "aa || aa || bb || cc || dd || ee"`
3.

`apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: restrict-image-registries annotations: policies.kyverno.io/title: Restrict Image Registries policies.kyverno.io/category: Best Practices, EKS Best Practices policies.kyverno.io/subject: Pod policies.kyverno.io/description: >- Images from unknown, public registries can be of dubious quality and may not be scanned and secured, representing a high degree of risk. Requiring use of known, approved registries helps reduce threat exposure by ensuring image pulls only come from them. This policy validates that container images only originate from the private ecr registry.
spec: validationFailureAction: audit background: true rules:
- name: validate-registries match: any:
  - resources: kinds:
    - Deployment
    - CronJob
    - Job
    - StatefulSet
    - DaemonSet
    - Pod validate: message: "Unknown image registry." pattern: spec: =(ephemeralContainers):
    - image: "1234.dkr.ecr.us-east-1.amazonaws.com/ | 2345..dkr.ecr.ap-south-1.amazonaws.com/" =(initContainers):
    - image: "1234.dkr.ecr.us-east-1.amazonaws.com/ | 2345..dkr.ecr.ap-south-1.amazonaws.com/" containers:
    - image: "1234.dkr.ecr.us-east-1.amazonaws.com/ | 2345..dkr.ecr.ap-south-1.amazonaws.com/"`

Then to test kyverno doesn't interfere with kube-system namespace, I brought down all the worker nodes of the EKS cluster and when the autoscaler brought new instances into the system, they were not in the ready state.

Steps to reproduce

Install kyverno using below helm chart - https://artifacthub.io/packages/helm/kyverno/kyverno
Install policy-reporter-ui using below helm chart - https://artifacthub.io/packages/helm/policy-reporter/policy-reporter
Install above mentioned three audit policies.
Bring down all the worker nodes of the cluster.

Expected behavior

Since by default kube-system and kyverno namespace was excluded from webhooks and resource filters both as per below config map, the new nodes should have been in the ready state to serve the pods as soon as they came up. Config map - ``Name: kyverno Namespace: kyverno Labels: app.kubernetes.io/component=config app.kubernetes.io/instance=kyverno app.kubernetes.io/managed-by=Helm app.kubernetes.io/part-of=kyverno app.kubernetes.io/version=3.2.6 helm.sh/chart=kyverno-3.2.6 helm.toolkit.fluxcd.io/name=kyverno helm.toolkit.fluxcd.io/namespace=kyverno Annotations: helm.sh/resource-policy: keep meta.helm.sh/release-name: kyverno meta.helm.sh/release-namespace: kyverno

Data defaultRegistry: docker.io enableDefaultRegistryMutation: true excludeGroups: system:nodes generateSuccessEvents: false resourceFilters: [/,kyverno,] [Event,,] [/,kube-system,] [/,kube-public,] [/,kube-node-lease,] [Node,,] [Node/,,] [APIService,,] [APIService/,,] [TokenReview,,] [SubjectAccessReview,,] [SelfSubjectAccessReview,,] [Binding,,] [Pod/binding,,] [ReplicaSet,,] [ReplicaSet/,,] [AdmissionReport,,] [AdmissionReport/,,] [ClusterAdmissionReport,,] [ClusterAdmissionReport/,,] [BackgroundScanReport,,] [BackgroundScanReport/,,] [ClusterBackgroundScanReport,,] [ClusterBackgroundScanReport/,,] [ClusterRole,,kyverno:admission-controller] [ClusterRole,,kyverno:admission-controller:core] [ClusterRole,,kyverno:admission-controller:additional] [ClusterRole,,kyverno:background-controller] [ClusterRole,,kyverno:background-controller:core] [ClusterRole,,kyverno:background-controller:additional] [ClusterRole,,kyverno:cleanup-controller] [ClusterRole,,kyverno:cleanup-controller:core] [ClusterRole,,kyverno:cleanup-controller:additional] [ClusterRole,,kyverno:reports-controller] [ClusterRole,,kyverno:reports-controller:core] [ClusterRole,,kyverno:reports-controller:additional] [ClusterRoleBinding,,kyverno:admission-controller] [ClusterRoleBinding,,kyverno:background-controller] [ClusterRoleBinding,,kyverno:cleanup-controller] [ClusterRoleBinding,,kyverno:reports-controller] [ServiceAccount,kyverno,kyverno-admission-controller] [ServiceAccount/,kyverno,kyverno-admission-controller] [ServiceAccount,kyverno,kyverno-background-controller] [ServiceAccount/,kyverno,kyverno-background-controller] [ServiceAccount,kyverno,kyverno-cleanup-controller] [ServiceAccount/,kyverno,kyverno-cleanup-controller] [ServiceAccount,kyverno,kyverno-reports-controller] [ServiceAccount/,kyverno,kyverno-reports-controller] [Role,kyverno,kyverno:admission-controller] [Role,kyverno,kyverno:background-controller] [Role,kyverno,kyverno:cleanup-controller] [Role,kyverno,kyverno:reports-controller] [RoleBinding,kyverno,kyverno:admission-controller] [RoleBinding,kyverno,kyverno:background-controller] [RoleBinding,kyverno,kyverno:cleanup-controller] [RoleBinding,kyverno,kyverno:reports-controller] [ConfigMap,kyverno,kyverno] [ConfigMap,kyverno,kyverno-metrics] [Deployment,kyverno,kyverno-admission-controller] [Deployment/,kyverno,kyverno-admission-controller] [Deployment,kyverno,kyverno-background-controller] [Deployment/,kyverno,kyverno-background-controller] [Deployment,kyverno,kyverno-cleanup-controller] [Deployment/,kyverno,kyverno-cleanup-controller] [Deployment,kyverno,kyverno-reports-controller] [Deployment/,kyverno,kyverno-reports-controller] [Pod,kyverno,kyverno-admission-controller-] [Pod/,kyverno,kyverno-admission-controller-] [Pod,kyverno,kyverno-background-controller-] [Pod/,kyverno,kyverno-background-controller-] [Pod,kyverno,kyverno-cleanup-controller-] [Pod/,kyverno,kyverno-cleanup-controller-] [Pod,kyverno,kyverno-reports-controller-] [Pod/,kyverno,kyverno-reports-controller-] [Job,kyverno,kyverno-hook-pre-delete] [Job/,kyverno,kyverno-hook-pre-delete] [NetworkPolicy,kyverno,kyverno-admission-controller] [NetworkPolicy/,kyverno,kyverno-admission-controller] [NetworkPolicy,kyverno,kyverno-background-controller] [NetworkPolicy/,kyverno,kyverno-background-controller] [NetworkPolicy,kyverno,kyverno-cleanup-controller] [NetworkPolicy/,kyverno,kyverno-cleanup-controller] [NetworkPolicy,kyverno,kyverno-reports-controller] [NetworkPolicy/,kyverno,kyverno-reports-controller] [PodDisruptionBudget,kyverno,kyverno-admission-controller] [PodDisruptionBudget/,kyverno,kyverno-admission-controller] [PodDisruptionBudget,kyverno,kyverno-background-controller] [PodDisruptionBudget/,kyverno,kyverno-background-controller] [PodDisruptionBudget,kyverno,kyverno-cleanup-controller] [PodDisruptionBudget/,kyverno,kyverno-cleanup-controller] [PodDisruptionBudget,kyverno,kyverno-reports-controller] [PodDisruptionBudget/,kyverno,kyverno-reports-controller] [Service,kyverno,kyverno-svc] [Service/,kyverno,kyverno-svc] [Service,kyverno,kyverno-svc-metrics] [Service/,kyverno,kyverno-svc-metrics] [Service,kyverno,kyverno-background-controller-metrics] [Service/,kyverno,kyverno-background-controller-metrics] [Service,kyverno,kyverno-cleanup-controller] [Service/,kyverno,kyverno-cleanup-controller] [Service,kyverno,kyverno-cleanup-controller-metrics] [Service/,kyverno,kyverno-cleanup-controller-metrics] [Service,kyverno,kyverno-reports-controller-metrics] [Service/,kyverno,kyverno-reports-controller-metrics] [ServiceMonitor,kyverno,kyverno-admission-controller] [ServiceMonitor,kyverno,kyverno-background-controller] [ServiceMonitor,kyverno,kyverno-cleanup-controller] [ServiceMonitor,kyverno,kyverno-reports-controller] [Secret,kyverno,kyverno-svc.kyverno.svc.] [Secret,kyverno,kyverno-cleanup-controller.kyverno.svc.] webhookAnnotations: {"admissions.enforcer/disabled":"true"} webhooks: [{"matchExpressions":[{"key":"kubernetes.io/metadata.name","operator":"NotIn","values":["kube-system"]}],"namespaceSelector":{"matchExpressions":[{"key":"kubernetes.io/metadata.name","operator":"NotIn","values":["kyverno"]}],"matchLabels":null}}]

BinaryData Events: ``

But had too execute below two commands to delete the web-hooks then the worker nodes came into ready state withing seconds. and this was not expcted and this manual intervention was not expected.

kubectl delete validatingwebhookconfiguration kyverno-resource-validating-webhook-cfg kubectl delete mutatingwebhookconfiguration kyverno-resource-mutating-webhook-cfg

Though I am running only 1 replica of each controller but even if I run 3 replicas for admission-controller, I dont think that will help in case complete cluster (worker nodes) in going down. So any suggestions on this?

Screenshots

No response

Kyverno logs

Only error log which I could see -

Warning  FailedCreate      19m (x95 over 7h57m)  daemonset-controller  Error creating: Internal error occurred: failed calling webhook "validate.kyverno.svc-fail": failed to call webhook: Post "[https://kyverno-svc.kyverno.svc:443/validate/fail?timeout=10s](https://kyverno-svc.kyverno.svc/validate/fail?timeout=10s)": no endpoints available for service "kyverno-svc"

Slack discussion

No response

Troubleshooting

[X] I have read and followed the documentation AND the troubleshooting guide.
[X] I have searched other issues in this repository and mine is not recorded.

welcome[bot] commented 2 months ago

Thanks for opening your first issue here! Be sure to follow the issue template!

anuragjain08 commented 2 months ago

Could someone please help here as this is completely blocking me from using kyverno in EKS Clusters.

realshuting commented 2 months ago

Have you read this https://kyverno.io/docs/troubleshooting/#kyverno-fails-on-eks?

anuragjain08 commented 2 months ago

Thanks for sharing this @realshuting Missed this one - but seems like my cluster has all the required config

I am using VPC CNI with the version v1.18.2-eksbuild.1 which is just n-4 and k8s version 1.30 both are supported with each other.
I already having port open 9443 on my worker node SG inbound rules from the cluster security group (this is the additional cluster security group). The same worker node SG is attached to the ENIs as well.

So, what else should I check in this case or am I missing something here? @realshuting

realshuting commented 2 months ago

Have you verified this?

$ kubectl run busybox --rm -ti --image=busybox -- /bin/sh
If you don't see a command prompt, try pressing enter.
/ # wget --no-check-certificate --spider --timeout=1 [https://kyverno-svc.kyverno.svc:443/health/liveness](https://kyverno-svc.kyverno.svc/health/liveness)
Connecting to kyverno-svc.kyverno.svc:443 (100.67.141.176:443)
remote file exists
/ # exit
Session ended, resume using 'kubectl attach busybox -c busybox -i -t' command when the pod is running
pod "busybox" deleted

anuragjain08 commented 2 months ago

@realshuting I am getting TLS handshake error kubectl run busybox --rm -ti --image=busybox -- /bin/sh If you don't see a command prompt, try pressing enter. / # / # / # wget --no-check-certificate --spider --timeout=1 https://kyverno-svc.kyverno.svc:443/health/liveness Connecting to kyverno-svc.kyverno.svc:443 (10.100.229.190:443) wget: TLS error from peer (alert code 40): handshake failure wget: error getting response: Connection reset by peer

Also I can see : kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations | grep kyverno mutatingwebhookconfiguration.admissionregistration.k8s.io/kyverno-resource-mutating-webhook-cfg 0 6d21h

Also there is a timing mismatch between the pod creation time and web-hook creation timing.

But I am bit struggling to understand how this will block from nodes to come up inc case complete cluster is going down?

anuragjain08 commented 1 month ago

Hi @realshuting , is there anything else which I am missing here or could take a look and sort this out ? Thank you.

realshuting commented 1 month ago

Hi @anuragjain08 - there's something blocking the connection, you need to check if there's any firewall rules / networkpolicy set for your cluster.

Until the connection issue is addressed, you can enable the Ignore failurePolicy (see this Helm option) to bypass failed requests.

kyverno / kyverno