rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.52k stars 264 forks source link

RKE2 failing to start: fatal, Failed to apply network policy default-network-ingress-webhook-policy to namespace kube-system #5693

Closed tmmorin closed 3 months ago

tmmorin commented 5 months ago

Context:

Apr 08 19:17:33 management-cluster-cp-d5098df345-mnpm4 rke2[3922111]: time="2024-04-08T19:17:33Z" level=fatal msg="Failed to apply network policy default-network-ingress-webhook-policy to namespace kube-system: Internal error occurred: failed calling webhook \"rancher.cattle.io.namespaces\": failed to call webhook: Post \"[https://rancher-webhook.cattle-system.svc:443/v1/webhook/validation/namespaces?timeout=10s\":](https://rancher-webhook.cattle-system.svc/v1/webhook/validation/namespaces?timeout=10s\%22:) context deadline exceeded"

This error is produced by this part of RKE2 code:

https://github.com/rancher/rke2/blob/bbda82440b71ff55352d6ead35d169afee3d3387/pkg/rke2/np.go#L213-L225

This code, after applying network policies for namespaces, is annotating those namespaces, which in the presence of webhooks triggering on updates of Namespaces, does not work at this early stage of RKE2 startup (this is due to another issue which has been around for a while, related to the fact that kube-proxy in early stages of RKE2 startup isn't ready to setup connectivity to webhook service, see https://github.com/rancher/rke2/issues/4781#issuecomment-1730187008).

tmmorin commented 5 months ago

for reference, the issue that we have in the Sylva project about this issue: https://gitlab.com/sylva-projects/sylva-core/-/issues/1155

tmmorin commented 5 months ago

hello @brandond -- I see you commented at https://github.com/rancher/rke2/issues/4781#issuecomment-1730187008 which is related to this issue here

it seems to me that the class of possible cases where "RKE2 startup is prevented by a webhook acting on some API operation done before kube-proxy is ready" would need to be addressed ... could that be solved by changing when kube-proxy is setup ?

brandond commented 5 months ago

RKE2 uses annotations on the system namespaces to track the state of various hardening processes that should only be performed once. Any products that deploy fail-closed webhooks that block modifications to the system namespaces are likely to break RKE2, if deployed before the hardening occurs, or during upgrades that make changes to the hardened policies.

I personally think deploying fail-closed webhooks that block changes to core types, and hosting the webhook on a pod in the cluster it is protecting, is a bad idea. It is super common to end up with chicken-and-egg problems like this during a cold cluster restart - but it seems to be a reoccurring pattern across the ecosystem.

We can evaluate changing how we track our hardening to avoid modifying the system namespaces, but this is unlikely to be changed soon.

tmmorin commented 5 months ago

Any products that deploy fail-closed webhooks that block modifications to the system namespaces are likely to break RKE2, if deployed before the hardening occurs, or during upgrades that make changes to the hardened policies.

Rancher Server itself would I think fall in this category, right ?

[...] if deployed before the hardening occurs, or during upgrades that make changes to the hardened policies.

This includes simple scenarios like:

My feeling here is that the central issue is that RKE2 won't start if some API actions that it wants to do trigger some fail-closed webhook. It seems to me that addressing this issue is needed beyond this Namespace-hardening-specific issue here, and that solving it would solve this issue among others.

I don't disagree that perhaps "webhooks that block changes to core types, and hosting the webhook on a pod in the cluster it is protecting, is a bad idea", but given that this common place, in particular in the Rancher/RKE2 ecosystem, then isn't it worth making RKE2 more robust to this ?

Also, as a side-node: the RKE2 hardening code simply annotates the Namespaces apparently simply to keep track that the network policies have been applied. I would tend to see some drawbacks of doing it like that:

Last, today, some of those network policies will be applied even if the component that they relate to isn't enabled in RKE2 (e.g. the ingress-nginx network policies are applied even if ingress-nginx deployment by RKE2 is disabled).

brandond commented 5 months ago
  • it does not help updating the content of an existing Network policy

That is intentional. Once the policies are installed and the annotation added, RKE2 will not change them, so that administrators can modify them as necessary to suit their needs. The annotations can be removed to force RKE2 to re-sync the policies.

  • a platform engineering team using RKE2 might want to apply/update network policies with a different tooling

You are welcome to do that; once RKE2 has created them it will no longer modify them as long as the annotations on the NS remain in place.

Like I said earlier, we can look at different ways to do this, but RKE2 has functioned like this for quite a while, and we are unlikely to refactor it on short notice.

tmmorin commented 5 months ago

[...] we are unlikely to refactor it on short notice.

Of course, I understand this well, and would not ask for that.

We have already implemented what is a viable short-term workaround for this issue, by ensuring that these annotations are set before RKE2 upgrade (https://gitlab.com/sylva-projects/sylva-core/-/issues/1155).

a platform engineering team using RKE2 might want to apply/update network policies with a different tooling

You are welcome to do that; once RKE2 has created them it will no longer modify them as long as the annotations on the NS remain in place.

Well, as said above, this works at short term, but for each new version of RKE2 we'll have to check/discover if new such annotations are necessary, and we have to maintain and test the code that ensures that this is done prior to the upgrade.

I'd rather prefer an approach where we could "opt out" of this : a configuration flag allowing to not have RKE2 handle these network policies. Or perhaps have them shipping as a Helm chart like some other base charts (e.g. the CNI). Or, for the particular case of network policies related to ingress-ngninx, have them bundled in the ingress-nginx chart (so that we won't not have the network policies if we set disableComponents.pluginComponents: [rke2-ingress-nginx]).

But again, the underlying issue behind that looks more important to me: the fact that we can't have any fail-close webhooks on any resource that RKE2 would try to touch during the early stages where kube-proxy isn't ready, is seriously limiting. I of course wouldn't ask for a short term fix on this either, but I'm interested to know what are the plans about this.

brandond commented 4 months ago

matchConditions are GA in 1.30; I'd like to see folks start using those to exclude system users or groups from webhooks.

fmoral2 commented 3 months ago

Validated on Version:

-$  rke2 version v1.30.2-rc5+rke2r1 (3f678f964ad849e24449e49f0c2c44e75d944c9f)

Environment Details

Infrastructure Cloud EC2 instance

Node(s) CPU architecture, OS, and Version: ubuntu AMD

Cluster Configuration: -3 node server -1 node agents

Steps to validate the fix

  1. Install rke2
  2. Install helm webhooks
  3. Join a new node on a upgraded version
  4. Validate rke2 is up and running
  5. Validate that no error from webhook is seen in the logs
  6. Validate pods

Reproduction Issue:

``` rke2 version v1.27.2+rke2r1 (300a06dabe679c779970112a9cb48b289c17536c) helm repo add rancher-latest https://releases.rancher.com/server-charts/latest helm install rancher rancher-latest/rancher \ --namespace cattle-system \ --set hostname=rancher.yourdomain.com kubectl create namespace kyverno helm repo add kyverno https://kyverno.github.io/kyverno/ helm install kyverno kyverno/kyverno --namespace kyverno kubectl get validatingwebhookconfigurations NAME WEBHOOKS AGE cert-manager-webhook 1 3m12s rke2-ingress-nginx-admission 1 21m rke2-snapshot-validation-webhook 1 21m validating-webhook-configuration 12 86s :~> kubectl get mutatingwebhookconfigurations NAME WEBHOOKS AGE cert-manager-webhook 1 3m21s mutating-webhook-configuration 9 95s On a new node joining the cluster upgrading version. sudo curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION=v1.28.8+rke2r1 sh - sudo journalctl -u rke2-server -f | grep "failed to call webhook" Jun 21 12:01:31 rke2[2060]: time="2024-06-21T12:01:31Z" level=warning msg="Failed to create Kubernetes secret: Internal error occurred: failed calling webhook \"rancher.cattle.io.secrets\": failed to call webhook: Post \"https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/secrets?timeout=15s\": context deadline exceeded" ``` ## **Validation Results:**
``` helm repo add rancher-latest https://releases.rancher.com/server-charts/latest helm install rancher rancher-latest/rancher --version 2.8.5 --namespace cattle-system --create-namespace --set hostname=rancher.yourdomain.com kubectl create namespace kyverno helm repo add kyverno https://kyverno.github.io/kyverno/ helm install kyverno kyverno/kyverno --namespace kyverno kubectl get validatingwebhookconfigurations NAME WEBHOOKS AGE cert-manager-webhook 1 3m12s rke2-ingress-nginx-admission 1 21m rke2-snapshot-validation-webhook 1 21m validating-webhook-configuration 12 86s :~> kubectl get mutatingwebhookconfigurations NAME WEBHOOKS AGE cert-manager-webhook 1 3m21s mutating-webhook-configuration 9 95s On a new node joining the cluster upgrading version. sudo curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION=v1.30.2-rc5+rke2r1 sh - sudo journalctl -u rke2-server -f | grep "failed to call webhook" <> > kubectl get mutatingwebhookconfigurations NAME WEBHOOKS AGE cert-manager-webhook 1 9m22s kyverno-policy-mutating-webhook-cfg 1 6m50s kyverno-resource-mutating-webhook-cfg 0 6m49s kyverno-verify-mutating-webhook-cfg 1 6m49s kubectl get validatingwebhookconfigurations NAME WEBHOOKS AGE cert-manager-webhook 1 9m35s kyverno-cleanup-validating-webhook-cfg 1 7m35s kyverno-exception-validating-webhook-cfg 1 7m3s kyverno-global-context-validating-webhook-cfg 1 7m3s kyverno-policy-validating-webhook-cfg 1 7m3s kyverno-resource-validating-webhook-cfg 0 7m2s kyverno-ttl-validating-webhook-cfg 1 7m35s rke2-ingress-nginx-admission 1 51m rke2-snapshot-validation-webhook 1 51m ```
Kellen275 commented 1 month ago

It looks like this was backported to v1.28.11. Is there a recommended workaround solution for folks on earlier 1.28 versions?

brandond commented 1 month ago

If possible, you can temporarily edit the webhook configuration to fail open so that rke2 can start up successfully. Once that's done you can revert it to the desired configuration.

Preferably you would upgrade though.