Closed riuvshyn closed 11 months ago
Have you opened an issue on the kyverno side? I will note that just putting it into audit mode will not help you if the webhooks are set to fail closed; if the webhook is set to intercept management of secrets or other resources that are critical to the startup of rke2, failing closed when the backend is unavailable can easily break things to the point where the cluster will not start until you remove the webhook configuration. This is a known issue with webhooks in general, especially when the webhook endpoint is hosted within the cluster it is supposed to protect.
Thanks for looking in to this one!
I did not open ticket on kyverno side as I believe that it is not kyverno specific problem, as you mentioned it fails due to the validating webhook not being able to connect to the controller due to missing kube-proxy
and it is indeed has failurePolicy: Fail
.
I understand the problem with webhooks but thought that kube-proxy.yaml
can be actually created even if k8s API is not reachable, if it is not the case I guess I will need to thing about some workarounds.
Also, what other resources are critical for start up rke2 other than secrets?
kube-proxy should be able to run, as it is a static pod deployed on each node - it does not need to go through scheduling or any of the usual stuff that would block a pod being created via the apiserver.
It is possible that the webhook might be blocking the overall startup of RKE2 though, including the bit that drops the kube-proxy static pod manifest. You should see errors in the rke2 journald logs about this though. Have you checked that yet?
kube-proxy is able to run for sure, but manifest for it isn't getting created in /var/lib/rancher/rke2/agent/pod-manifests/
for some reason... If I create manifest manually it gets picked up and works then.
the only kube-proxy related logs in rke2-server I see is:
Sep 21 12:41:55 ip-172-29-195-214 rke2[21652]: time="2023-09-21T12:41:55Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Kyverno also have resource filtering, I will try to configure it maybe I can configure filter to skip resources on cattle-*
namespaces
There should be other messages in there that indicate why the readyz check isn't passing, such as waiting for the apiserver or roles or so on.
You can also do kubectl get --raw /readyz?verbose
on the server to see what apiserver health checks are failing, if any.
will try to configure it maybe I can configure filter to skip resources on cattle-* namespaces
rke2 doesn't use the cattle namespaces, those are only used by rancher. RKE2 stuff is all in kube-system.
Looks like all apiserver endpoints are ok
[+]ping ok
[+]log ok
[+]etcd ok
[+]etcd-readiness ok
[+]informer-sync ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]shutdown ok
readyz check passed
here is also kyverno validatingWH rules:
rules:
- apiGroups:
- ""
apiVersions:
- v1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- persistentvolumeclaims
- pods
- pods/ephemeralcontainers
- replicationcontrollers
scope: '*'
- apiGroups:
- apps
apiVersions:
- v1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- daemonsets
- deployments
- replicasets
- statefulsets
scope: '*'
- apiGroups:
- autoscaling
apiVersions:
- v1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- horizontalpodautoscalers
scope: '*'
- apiGroups:
- autoscaling
apiVersions:
- v2
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- horizontalpodautoscalers
scope: '*'
- apiGroups:
- batch
apiVersions:
- v1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- cronjobs
- jobs
scope: '*'
- apiGroups:
- flowcontrol.apiserver.k8s.io
apiVersions:
- v1beta2
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- flowschemas
- prioritylevelconfigurations
scope: '*'
- apiGroups:
- flowcontrol.apiserver.k8s.io
apiVersions:
- v1beta3
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- flowschemas
- prioritylevelconfigurations
scope: '*'
- apiGroups:
- monitoring.coreos.com
apiVersions:
- v1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- prometheusrules
scope: '*'
- apiGroups:
- networking.k8s.io
apiVersions:
- v1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- ingresses
scope: '*'
- apiGroups:
- policy
apiVersions:
- v1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- poddisruptionbudgets
scope: '*'
- apiGroups:
- rbac.authorization.k8s.io
apiVersions:
- v1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- clusterrolebindings
- rolebindings
scope: '*'
- apiGroups:
- scheduling.k8s.io
apiVersions:
- v1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- priorityclasses
scope: '*'
- apiGroups:
- storage.k8s.io
apiVersions:
- v1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- csistoragecapacities
scope: '*'
- apiGroups:
- storage.k8s.io
apiVersions:
- v1beta1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- csistoragecapacities
scope: '*'
It doesn't track secrets/configMaps etc...
also
rke2-server actually fails with:
Sep 21 17:57:41 ip-172-29-195-106 rke2[1367]: time="2023-09-21T17:57:41Z" level=fatal msg="clusterrole: EnsureRBACPolicy failed: unable to initialize roles: timed out waiting for the condition"
and RBAC is actually tracked by the WH rules:
- apiGroups:
- rbac.authorization.k8s.io
apiVersions:
- v1
operations:
- CREATE
- UPDATE
- DELETE
- CONNECT
resources:
- clusterrolebindings
- rolebindings
scope: '*'
I guess that's why it gets faileds and just can't get to the point where kube-proxy manifest is created
I will try to filter configure kyverno to skip cattle-*
and rke2-*
cluster roles/rolebindings
Ah yeah, that'd be it. We ensure the state of some core RBAC for the embedded controllers; if kyverno is blocking those it would definitely create problems.
yeah, this is not rke2 problem then, looks like I can workaround this only in 1.27+ with AdmissionWebhookMatchConditions
as I would need to skip clusterrolebindings rke2-server is trying to provision.
but if rke2-server did create kube-proxy manifests before it was trying to provison RBAC it would eventually work, but I am not sure if that's make sense from RKE2 perspective.
it doesn't drop the kube-proxy manifest until after the rest of the core stuff is up, as it needs to query some stuff from the core in order to determine whether or not to enable kube-proxy, since kube-proxy is considered a client component. It's definitely a bit complicated.
@riuvshyn could you please share, how do you able to workaround this issue? Thank you
hey @gapopp It really depends on which kyverno policies you have... so first of all check your kyverno admission failures.
In my case I had to configure following policies:
restrict-binding-clusteradmin
restrict-binding-system-groups
with failurePolicy
set to Ignore
to stop blocking rke2 bootstrapping.
And once I will be at 1.27 I was planning to tune kyverno WH with AdmissionWebhookMatchConditions
to exclude rke2 related stuff.
for the record, we're seeing a variant of this issue in https://github.com/rancher/rke2/issues/5693
I would tend to think that changing the failurePolicy
of webhooks to Ignore
is not a satisfying solution to this problem, because (a) changing this field may or may not be easy (for Rancher webhooks, it definitely does not seem trivial), and because (b) having a failurePolicy: Fail
is a legitimate value that can be chosen to enforce things for security reason and that loosening how strictly such things are enforced is not a decision that can always be made.
Environmental Info: RKE2 Version: v1.26.7+rke2r1
Cluster Configuration: 3 CP / 3 Workers
Describe the bug: Control Plane node rotation fails after installed kyverno New node comes up and some k8s components are actually getting provisioned except
kube-proxy
.rke2-server
is crashlooping and kube-proxy is never gets provisioned. Inrke2-server
logs not much going on related to kube-proxy:For some reason
kube-proxy.yaml
manifests is not getting created in/var/lib/rancher/rke2/agent/pod-manifests/
and there are only:
Manual creation of
/var/lib/rancher/rke2/agent/pod-manifests/kube-proxy.yaml
allows to bootstrapping CP node to continue and then it works fine.This bug can't be reproduced with kyverno admission controller scaled down.
Steps To Reproduce:
Expected behavior: CP node is rotated and on new CP node all k8s core components are deployed and functioning as expected
Actual behavior: CP node is rotated and on new CP node
kube-proxy
is missing which prevents CP node to transition to healthy state.Additional context / logs: All kyverno policies are in Audit mode so shouldn't be actually blocking anything and I believe it will reproduce even without any policies created it looks like the issue is kyverno mutation/validation webhooks which are supposed to intercept all traffic to API server. Looks like rke2-server is attempting to perform some k8s API calls but it is getting failed as kyverno is not accessible without kube-proxy.