Closed viccuad closed 2 years ago
I think the problem is that helm install --wait kubewarden-controller
doesn't wait for the ready probe, or the ready probe signals OK before the kubewarden-controller has had time to setup the webhooks (since on install, the controller charts creates the certs via cert-manager, and then registers itself in the admission controller). Hence, on first run, the controller is not ready to instantiate the default
PolicyServer:
$ helm install --wait -n kubewarden kubewarden-controller kubewarden/kubewarden-controller --debug
install.go:173: [debug] Original chart version: ""
install.go:190: [debug] CHART PATH: /home/vic/.cache/helm/repository/kubewarden-controller-0.3.3.tgz
client.go:122: [debug] creating 23 resource(s)
wait.go:53: [debug] beginning wait for 23 resources with timeout of 5m0s
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
wait.go:244: [debug] Deployment is not ready: kubewarden/kubewarden-controller. 0 out of 1 expected pods are ready
client.go:282: [debug] Starting delete for "default" PolicyServer
client.go:311: [debug] policyservers.policies.kubewarden.io "default" not found
client.go:122: [debug] creating 1 resource(s)
Error: failed post-install: warning: Hook post-install kubewarden-controller/templates/policyserver-default.yaml failed: Internal error occurred: failed calling webhook "mpolicyserver.kb.io": Post "https://kubewarden-controller-webhook-service.kubewarden.svc:443/mutate-policies-kubewarden-io-v1alpha2-policyserver?timeout=10s": dial tcp 10.43.246.134:443: connect: connection refused
helm.go:81: [debug] failed post-install: warning: Hook post-install kubewarden-controller/templates/policyserver-default.yaml failed: Internal error occurred: failed calling webhook "mpolicyserver.kb.io": Post "https://kubewarden-controller-webhook-service.kubewarden.svc:443/mutate-policies-kubewarden-io-v1alpha2-policyserver?timeout=10s": dial tcp 10.43.246.134:443: connect: connection refused
Maybe this is related: helm/helm/issues/3173
I've tried to simulate this issue in bunch of different K3S versions. I wrote a little script trying to simulate this using different K3S versions:
#!/bin/bash
VERSIONS=`curl https://api.github.com/repos/k3s-io/k3s/releases | jq ".[] | select(.prerelease==false and .draft==false) | .name" | sed "s/+/-/" | sed "s/\"//g" | sort`
echo $VERSIONS
for VERSION in $VERSIONS; do
echo "###################################################"
echo "##### Testing $VERSION ####"
echo "###################################################"
make K3S_IMAGE=rancher/k3s:$VERSION install-e2e-test
done
The install-e2e-test is you commands from the bug description:
.PHONY: delete-k8s-cluster
delete-k8s-cluster:
- k3d cluster delete $(CLUSTER_NAME)
.PHONY: create-k8s-cluster
create-k8s-cluster: delete-k8s-cluster
k3d cluster create $(CLUSTER_NAME) --wait --timeout $(TIMEOUT) --image $(K3S_IMAGE) --config $(RESOURCES_DIR)/k3d-default.yaml -v /dev/mapper:/dev/mapper
$(call kube, wait --for=condition=Ready nodes --all)
.PHONY: install-e2e-test
install-e2e-test: CLUSTER_NAME := testing-cert-manager-helm-chart
install-e2e-test: create-k8s-cluster
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install --wait cert-manager jetstack/cert-manager --create-namespace -n cert-manager --set installCRDs=true
helm repo add kubewarden https://charts.kubewarden.io
helm repo update
helm install --create-namespace -n kubewarden kubewarden-crds kubewarden/kubewarden-crds
helm install --wait -n kubewarden kubewarden-controller kubewarden/kubewarden-controller
What's the helm version you are using? Furthermore, considering this is not a Helm issue, I'm thinking how to workaround this. What do you think about add another post-install hook to wait until the kubewarden-controller be ready before trying to create the default policy server?
Another option that I can think of is to find a way to create the default policy server resource outside of the post-install hook and update the controller to check pending policy servers resources to be deployed and create them as soon as possible.
I'm taking a look in the controller-runtime
and kubewarden-controller
code to understand when we set the controller as ready
. Turns out the controller always returns as ready
. We are using the Ping
checker. Which always returns true. Furthermore, the controller-runtime
starts the health probes in a gorotine before the goroutine to trigger the manager runnables. Thus, if the helm is fast enough, it could try to run the post-install
hook before the controller really ready.
I'm not 100% sure if my rationale is right. Because I do not have too much experience with Kubernetes controllers. But I think we can try to improve our health check to ensure that we set the controller as ready only after the policy server controller start running. However, I could not find any easy way to do that using the data available in the manager.
I'm disconsidering that this is a Helm issue as mentioned before by @viccuad. Just because I could not simulate the issue and the bug mentioned should be fix in the latest Helm versions.
What's the helm version you are using?
Here I have v3.5.4, which should contain the fix on the Helm issue. But someone reported a regression on v3.6.3. I haven't tried yet.
But given your awesome later comment José, I agree on improving the heath check, or better adding a readiness probe :).
I suppose the readiness probe needs to check that the controller manager has registered completely our controller into the admission controller. So our controller itself is ready to start registering policy-servers to the admission controller (what currently fails on the post-install hook, when it tries to register the default policy-server).
Closing the card as discussed in planning meeting, as the scope is already achieved. Work continues in https://github.com/kubewarden/kubewarden-controller/issues/142.
I'm reopening this because what I described in https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1002220793 is not the actual issue. Even if what I said there is still right.
The problem here is network issue. There a bunch of different project suffering of similar problems like nginx-ingress and cert-manager. The solution of the problem can change depending on the environment.
I could not simulate the issue in my local cluster. But I can see the problem using @viccuad dotfiles repository[1]. But this is not a easy issue to simulate. To help me on that I wrote this script:
#!/bin/bash
minikube delete --all
minikube start --feature-gates=EphemeralContainers=true --extra-config=apiserver.v=5
CONTEXT=minikube
CRDS_CHART=kubewarden/kubewarden-crds
CONTROLLER_CHART=./kubewarden-controller
DATE=`date +"%Y-%m-%d-%R"`
kubectl --context $CONTEXT wait --for=condition=Ready nodes --all
kubectl --context $CONTEXT create namespace kubewarden
kubectl --context $CONTEXT apply -f https://github.com/jetstack/cert-manager/releases/latest/download/cert-manager.yaml
kubectl --context $CONTEXT wait --for=condition=Available deployment --timeout=2m -n cert-manager --all
helm install --kube-context $CONTEXT --wait -n kubewarden --create-namespace kubewarden-crds $CRDS_CHART
versions="controller-readiness-probe"
#this loop is not needed. Because I do not use the version anymore. But it is for legacy reason
for version in $versions; do
echo ""
log_file=controller-$version-$DATE.logs
rm --force $log_file
for num in {1..100}; do
echo "Try #$num"
echo "Try #$num" >> $log_file
date >> $log_file
helm uninstall --kube-context $CONTEXT --wait --timeout 1m -n kubewarden kubewarden-controller >> $log_file 2>&1
helm install --kube-context $CONTEXT --wait -n kubewarden kubewarden-controller $CONTROLLER_CHART >> $log_file 2>&1 || break
done
done
minikube logs > "minikube-$DATE.logs"
I still cannot find the exact root cause of the problem. However, I found a workaround for it. Well... let me rephrase this... My script does not fail anymore after the following change:
minikube start --feature-gates=EphemeralContainers=true --extra-config=apiserver.v=5 --cni=cilium
In other words, after changing the CNI everything runs fine. @viccuad , can you validate this in your environment as well?
I do not believe this is the best solution for this. But we need to dig more in the network space to, maybe, really fix the problem.
So, I think we can do the following:
@nunix mentioned that the same issue is happening in the Windows on Rancher Desktop, right?. There is a open issue for Rancher if a similar issue. Could it be related?
[1] I've removed the Ansible steps from the Vagrant file to simplify the deployment.
@flavio @viccuad @ereslibre any thoughs about https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1035524285?
I didn't have the time to check that. I found some issues with @viccuad on this environment when using k9s. Docker containers had connectivity, but couldn't resolve names. We changed the resolv.conf
on one of the containers to point to 8.8.8.8 and it could resolve fine. At that point in time containers started to appear as Running
.
I don't have any comments about https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1035524285 to be honest.
My opinion is that it's a really hard problem to solve because we don't know what is wrong and is not easy to reproduce and track.
There is also the possibility that different problems have the same consequence. E.g, a myriad of problems can lead to a single error connection refused
. To simplify in such a situation: is something wrong on the client?, is something wrong on the server?, is something wrong in between?. What exactly, where, how...? However, you always see the same error: connection refused
, but literally a myriad reasons can lead to show that error.
In my opinion there is not enough information and evidence to tackle this problem properly.
I vote to not try to tackle it given the impact is extremely small. If the situation changes, priority will raise and we will be in a better situation to debug, identify and fix.
Did the resolv.conf
change fix the issue of kube api not being able to find the webhook service?
During my investigations I wrote a script which execute the nslookup
while reinstalling Kubewarden. I could see issues in the name resolution. But I'm wondering why adding the 8.8.8.8
would fix the issue of accessing an internal cluster resource.
Did the resolv.conf change fix the issue of kube api not being able to find the webhook service?
We were looking because on his environment something was off at least when using k3d
(I think other envs as well). We were just debugging why with k9s images could not be pulled, not this issue specifically. What I try to say is that on that environment there were some odd things.
I didn't try this reproducible vagrant environment though, so sorry for the noise.
I think I know what's going on... kubernetes services, by default, use an iptables proxy mode. This mean that when a process tries to access a clusterIP iptables rules redirect the packets to the right pod. This rules are written by kube-proxy which watches the control plane for new services and endpoints.
My hypothesis is that kube-proxy is not fast enough to detect the changes and rewrite the iptables rules. Thus, when kube-api tries to access the service there is no route to the pod, causing the error. This also explains why after some time (very short) we can reach the service.
To validate this idea I've retested using the script from https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1035524285 and change the line where minikube cluster is started with:
minikube start --feature-gates=EphemeralContainers=true --extra-config=apiserver.v=5 --extra-config 'kube-proxy.mode=userspace'
Changing the proxy mode to userspace
the clusterIP now points to kube-proxy. Kube-proxy will then proxy the packet to the pod. It works! Again, at least my script is not failing anymore.
Considering that we already have 2 workarounds for this, I believe we can close this issue. Furthermore, if we want improve the user experience we may try to add some mechanics to detect this behavior in the helm installation as mentioned in https://github.com/kubewarden/kubewarden-controller/issues/110#issuecomment-1035524285
@viccuad feel free to reopen with if there is something missing.
Ah! Using the kube-proxy in userspace
mode is not recommended for production because it is slower. I just used it as an example of how to validate my hypothesis.
Thanks a lot @jvanz, this is really helpful! Also, I hit this problem one time on my environment when running the e2e tests several times, so I can also reproduce the issue to some extent :)
Great investigation work. Kudos!
Thanks indeed José for the investigation! Validated the userspace workaround here too, will go with that until I re-imagine my stack.
With the cert-manager helm chart:
Every time I try with those commands, on a fresh k3d cluster, the kubewarden-controller post-intall hook with the default policy-server fails:
or
This doesn't happen to me if installing cert-manager via static install:
I suppose it's just timing on my system, but it's incredibly annoying. I looked up the manager ready probes, and the state of the resources after deploying our kubewarden-controller chart, and they seem fine. I don't see why it happens yet.