Investigate failure on webhooks not ready when installing cert-manager from helm chart

viccuad commented 3 years ago

With the cert-manager helm chart:

$ helm repo add jetstack https://charts.jetstack.io && helm repo update
$ helm install --wait cert-manager jetstack/cert-manager --create-namespace -n cert-manager --set installCRDs=true

$ helm repo add kubewarden https://charts.kubewarden.io && helm repo update
$ helm install --create-namespace -n kubewarden kubewarden-crds kubewarden/kubewarden-crds
$ helm install --wait -n kubewarden kubewarden-controller kubewarden/kubewarden-controller

Every time I try with those commands, on a fresh k3d cluster, the kubewarden-controller post-intall hook with the default policy-server fails:

Error: failed post-install: warning: Hook post-install kubewarden-controller/templates/policyserver-default.yaml failed: Internal error occurred: failed calling webhook "mpolicyserver.kb.io": Post "https://kubewarden-controller-webhook-service.kubewarden.svc:443/mutate-policies-kubew
arden-io-v1alpha2-policyserver?timeout=10s": dial tcp 10.43.7.43:443: connect: connection refused

or

Error: failed post-install: warning: Hook post-install kubewarden-controller/templates/policyserver-default.yaml failed: Internal error occurred: failed calling webhook "mpolicyserver.kb.io": Post "https://kubewarden-controller-webhook-service.kubewarden.svc:443/mutate-policies-kubewarden-io-v1alpha2-policyserver?timeout=10s": context deadline exceeded

This doesn't happen to me if installing cert-manager via static install:

$ kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.4/cert-manager.yaml

I suppose it's just timing on my system, but it's incredibly annoying. I looked up the manager ready probes, and the state of the resources after deploying our kubewarden-controller chart, and they seem fine. I don't see why it happens yet.

viccuad commented 2 years ago

I think the problem is that helm install --wait kubewarden-controller doesn't wait for the ready probe, or the ready probe signals OK before the kubewarden-controller has had time to setup the webhooks (since on install, the controller charts creates the certs via cert-manager, and then registers itself in the admission controller). Hence, on first run, the controller is not ready to instantiate the default PolicyServer:

$ helm install --wait -n kubewarden kubewarden-controller kubewarden/kubewarden-controller --debug
install.go:173: [debug] Original chart version: ""
install.go:190: [debug] CHART PATH: /home/vic/.cache/helm/repository/kubewarden-controller-0.3.3.tgz

client.go:122: [debug] creating 23 resource(s)
wait.go:53: [debug] beginning wait for 23 resources with timeout of 5m0s
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
client.go:282: [debug] Starting delete for "default" PolicyServer
client.go:311: [debug] policyservers.policies.kubewarden.io "default" not found
client.go:122: [debug] creating 1 resource(s)
Error: failed post-install: warning: Hook post-install kubewarden-controller/templates/policyserver-default.yaml failed: Internal error occurred: failed calling webhook "mpolicyserver.kb.io": Post "https://kubewarden-controller-webhook-service.kubewarden.svc:443/mutate-policies-kubewarden-io-v1alpha2-policyserver?timeout=10s": dial tcp 10.43.246.134:443: connect: connection refused
helm.go:81: [debug] failed post-install: warning: Hook post-install kubewarden-controller/templates/policyserver-default.yaml failed: Internal error occurred: failed calling webhook "mpolicyserver.kb.io": Post "https://kubewarden-controller-webhook-service.kubewarden.svc:443/mutate-policies-kubewarden-io-v1alpha2-policyserver?timeout=10s": dial tcp 10.43.246.134:443: connect: connection refused

viccuad commented 2 years ago

Maybe this is related: helm/helm/issues/3173

jvanz commented 2 years ago

I've tried to simulate this issue in bunch of different K3S versions. I wrote a little script trying to simulate this using different K3S versions:

#!/bin/bash

 VERSIONS=`curl https://api.github.com/repos/k3s-io/k3s/releases | jq ".[] | select(.prerelease==false and .draft==false) | .name"  | sed "s/+/-/" | sed "s/\"//g" | sort`
 echo $VERSIONS

for VERSION in $VERSIONS; do
        echo "###################################################"
        echo "#####            Testing $VERSION              ####"
        echo "###################################################"
        make K3S_IMAGE=rancher/k3s:$VERSION install-e2e-test
done

The install-e2e-test is you commands from the bug description:

.PHONY: delete-k8s-cluster
delete-k8s-cluster:
        - k3d cluster delete $(CLUSTER_NAME)

.PHONY: create-k8s-cluster
create-k8s-cluster: delete-k8s-cluster
        k3d cluster create $(CLUSTER_NAME) --wait --timeout $(TIMEOUT) --image $(K3S_IMAGE) --config $(RESOURCES_DIR)/k3d-default.yaml -v /dev/mapper:/dev/mapper
        $(call kube,  wait --for=condition=Ready nodes --all)

.PHONY: install-e2e-test
install-e2e-test: CLUSTER_NAME := testing-cert-manager-helm-chart
install-e2e-test: create-k8s-cluster
        helm repo add jetstack https://charts.jetstack.io
        helm repo update
        helm install --wait cert-manager jetstack/cert-manager --create-namespace -n cert-manager --set installCRDs=true
        helm repo add kubewarden https://charts.kubewarden.io
        helm repo update
        helm install --create-namespace -n kubewarden kubewarden-crds kubewarden/kubewarden-crds
        helm install --wait -n kubewarden kubewarden-controller kubewarden/kubewarden-controller

The logs of the script

What's the helm version you are using? Furthermore, considering this is not a Helm issue, I'm thinking how to workaround this. What do you think about add another post-install hook to wait until the kubewarden-controller be ready before trying to create the default policy server?

Another option that I can think of is to find a way to create the default policy server resource outside of the post-install hook and update the controller to check pending policy servers resources to be deployed and create them as soon as possible.

jvanz commented 2 years ago

I'm taking a look in the controller-runtime and kubewarden-controller code to understand when we set the controller as ready. Turns out the controller always returns as ready. We are using the Ping checker. Which always returns true. Furthermore, the controller-runtime starts the health probes in a gorotine before the goroutine to trigger the manager runnables. Thus, if the helm is fast enough, it could try to run the post-install hook before the controller really ready.

I'm not 100% sure if my rationale is right. Because I do not have too much experience with Kubernetes controllers. But I think we can try to improve our health check to ensure that we set the controller as ready only after the policy server controller start running. However, I could not find any easy way to do that using the data available in the manager.

I'm disconsidering that this is a Helm issue as mentioned before by @viccuad. Just because I could not simulate the issue and the bug mentioned should be fix in the latest Helm versions.

flavio commented 2 years ago

Good catch @jvanz :clap:

I think we should implement a readiness probe

viccuad commented 2 years ago

What's the helm version you are using?

Here I have v3.5.4, which should contain the fix on the Helm issue. But someone reported a regression on v3.6.3. I haven't tried yet.

But given your awesome later comment José, I agree on improving the heath check, or better adding a readiness probe :).

I suppose the readiness probe needs to check that the controller manager has registered completely our controller into the admission controller. So our controller itself is ready to start registering policy-servers to the admission controller (what currently fails on the post-install hook, when it tries to register the default policy-server).

viccuad commented 2 years ago

Closing the card as discussed in planning meeting, as the scope is already achieved. Work continues in https://github.com/kubewarden/kubewarden-controller/issues/142.

jvanz commented 2 years ago

I'm reopening this because what I described in https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1002220793 is not the actual issue. Even if what I said there is still right.

The problem here is network issue. There a bunch of different project suffering of similar problems like nginx-ingress and cert-manager. The solution of the problem can change depending on the environment.

I could not simulate the issue in my local cluster. But I can see the problem using @viccuad dotfiles repository[1]. But this is not a easy issue to simulate. To help me on that I wrote this script:

#!/bin/bash
minikube delete --all
minikube start --feature-gates=EphemeralContainers=true --extra-config=apiserver.v=5 

CONTEXT=minikube

CRDS_CHART=kubewarden/kubewarden-crds
CONTROLLER_CHART=./kubewarden-controller
DATE=`date +"%Y-%m-%d-%R"`

kubectl --context $CONTEXT wait --for=condition=Ready nodes --all
kubectl --context $CONTEXT create namespace kubewarden
kubectl --context $CONTEXT apply -f https://github.com/jetstack/cert-manager/releases/latest/download/cert-manager.yaml
kubectl --context $CONTEXT wait --for=condition=Available deployment --timeout=2m -n cert-manager --all
helm install --kube-context $CONTEXT --wait -n kubewarden --create-namespace kubewarden-crds $CRDS_CHART

versions="controller-readiness-probe"
 #this loop is not needed. Because I do not use the version anymore. But it is for legacy reason
for version in $versions; do
        echo ""
        log_file=controller-$version-$DATE.logs
        rm --force $log_file
        for num in {1..100}; do
        echo "Try #$num"
        echo "Try #$num" >> $log_file
                date >> $log_file
                helm uninstall --kube-context $CONTEXT --wait --timeout 1m -n kubewarden kubewarden-controller >> $log_file 2>&1
                helm install --kube-context $CONTEXT --wait -n kubewarden kubewarden-controller $CONTROLLER_CHART  >> $log_file 2>&1 || break
        done
done

minikube logs > "minikube-$DATE.logs"

I still cannot find the exact root cause of the problem. However, I found a workaround for it. Well... let me rephrase this... My script does not fail anymore after the following change:

minikube start --feature-gates=EphemeralContainers=true --extra-config=apiserver.v=5 --cni=cilium

In other words, after changing the CNI everything runs fine. @viccuad , can you validate this in your environment as well?

I do not believe this is the best solution for this. But we need to dig more in the network space to, maybe, really fix the problem.

So, I think we can do the following:

Document this in the docs.
Maybe change our helm charts to check if the service is indeed responding before trying to install a policy server. One option for this is to add an init container in the default policy server configuration to check if the Kubewarden controller is responding.

@nunix mentioned that the same issue is happening in the Windows on Rancher Desktop, right?. There is a open issue for Rancher if a similar issue. Could it be related?

[1] I've removed the Ansible steps from the Vagrant file to simplify the deployment.

jvanz commented 2 years ago

@flavio @viccuad @ereslibre any thoughs about https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1035524285?

ereslibre commented 2 years ago

I didn't have the time to check that. I found some issues with @viccuad on this environment when using k9s. Docker containers had connectivity, but couldn't resolve names. We changed the resolv.conf on one of the containers to point to 8.8.8.8 and it could resolve fine. At that point in time containers started to appear as Running.

I don't have any comments about https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1035524285 to be honest.

My opinion is that it's a really hard problem to solve because we don't know what is wrong and is not easy to reproduce and track.

There is also the possibility that different problems have the same consequence. E.g, a myriad of problems can lead to a single error connection refused. To simplify in such a situation: is something wrong on the client?, is something wrong on the server?, is something wrong in between?. What exactly, where, how...? However, you always see the same error: connection refused, but literally a myriad reasons can lead to show that error.

In my opinion there is not enough information and evidence to tackle this problem properly.

I vote to not try to tackle it given the impact is extremely small. If the situation changes, priority will raise and we will be in a better situation to debug, identify and fix.

jvanz commented 2 years ago

Did the resolv.conf change fix the issue of kube api not being able to find the webhook service?

During my investigations I wrote a script which execute the nslookup while reinstalling Kubewarden. I could see issues in the name resolution. But I'm wondering why adding the 8.8.8.8 would fix the issue of accessing an internal cluster resource.

ereslibre commented 2 years ago

Did the resolv.conf change fix the issue of kube api not being able to find the webhook service?

We were looking because on his environment something was off at least when using k3d (I think other envs as well). We were just debugging why with k9s images could not be pulled, not this issue specifically. What I try to say is that on that environment there were some odd things.

I didn't try this reproducible vagrant environment though, so sorry for the noise.

jvanz commented 2 years ago

I think I know what's going on... kubernetes services, by default, use an iptables proxy mode. This mean that when a process tries to access a clusterIP iptables rules redirect the packets to the right pod. This rules are written by kube-proxy which watches the control plane for new services and endpoints.

My hypothesis is that kube-proxy is not fast enough to detect the changes and rewrite the iptables rules. Thus, when kube-api tries to access the service there is no route to the pod, causing the error. This also explains why after some time (very short) we can reach the service.

To validate this idea I've retested using the script from https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1035524285 and change the line where minikube cluster is started with:

minikube start --feature-gates=EphemeralContainers=true --extra-config=apiserver.v=5 --extra-config 'kube-proxy.mode=userspace'

Changing the proxy mode to userspace the clusterIP now points to kube-proxy. Kube-proxy will then proxy the packet to the pod. It works! Again, at least my script is not failing anymore.

Considering that we already have 2 workarounds for this, I believe we can close this issue. Furthermore, if we want improve the user experience we may try to add some mechanics to detect this behavior in the helm installation as mentioned in https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1035524285

@viccuad feel free to reopen with if there is something missing.

jvanz commented 2 years ago

Ah! Using the kube-proxy in userspace mode is not recommended for production because it is slower. I just used it as an example of how to validate my hypothesis.

ereslibre commented 2 years ago

Thanks a lot @jvanz, this is really helpful! Also, I hit this problem one time on my environment when running the e2e tests several times, so I can also reproduce the issue to some extent :)

Great investigation work. Kudos!

viccuad commented 2 years ago

Thanks indeed José for the investigation! Validated the userspace workaround here too, will go with that until I re-imagine my stack.

kubewarden / kubewarden-controller

Investigate failure on webhooks not ready when installing cert-manager from helm chart #110