kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.47k stars 8.25k forks source link

Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io" #5401

Closed aduncmj closed 2 years ago

aduncmj commented 4 years ago

Hi all,

When I apply the ingress's configuration file named ingress-myapp.yaml by command kubectl apply -f ingress-myapp.yaml, there was an error. The complete error is as follows:

Error from server (InternalError): error when creating "ingress-myapp.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://ingress-nginx-controller-admission.ingress-nginx.svc:443/extensions/v1beta1/ingresses?timeout=30s: context deadline exceeded

This is my ingress:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: ingress-myapp
  namespace: default
  annotations: 
    kubernetes.io/ingress.class: "nginx"
spec:
  rules: 
  - host: myapp.magedu.com
    http:
      paths:
      - path: 
        backend: 
          serviceName: myapp
          servicePort: 80

Has anyone encountered this problem?

jhughes2112 commented 3 years ago

I had the same error as the original post. It popped up when applying the manifest for ingress-nginx that includes the namespace all the way through to the first ingress declaration. There's a race condition where the ingress-nginx-controller-admission.ingress-nginx.svc pod is not ready for traffic yet. I re-applied the manifest and it worked fine. So at the very least, there's the potential for this error to pop up trivially, as in my case.

mau21mau commented 3 years ago

Hi,

I have.

The validatingwebhook service is not reachable in my private GKE cluster. I needed to open the 8443 port from the master to the pods. On top of that, I then received a certificate error on the endpoint "x509: certificate signed by unknown authority". To fix this, I needed to include the caBundle from the generated secret in the validatingwebhookconfiguration.

A quick fix if you don't want to do the above and have the webhook fully operational is to remove the validatingwebhookconfiguration or setting the failurePolicy to Ignore.

I believe some fixes are needed in the deploy/static/provider/cloud/deploy.yaml as the webhooks will not always work out of the box.

That worked for me. I put a more detailed answer here on StackOverflow

kunchalavikram1427 commented 3 years ago

I got the same error for Path based routing. I just swapped the path and pathType to put pathType first for the first rule. If i do the same for other routes, it is not working... Kind check my example https://github.com/kunchalavikram1427/Kubernetes_public/tree/master/ingress-demo/path-based

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: flask-ingress-rules
spec:
  rules:
  - host: 
    http:
      paths:
      - pathType: Prefix
        path: /
        backend:
          service:
            name: flask-hello
            port:
              number: 80
      - path: /vikram
        pathType: Prefix
        backend:
          service:
            name: flask-greet
            port:
              number: 80
      - path: /details
        pathType: Prefix
        backend:
          service:
            name: flask-details
            port:
              number: 80
cforce commented 3 years ago

For me the "kubectl delete validatingwebhookconfigurations public-nginx-ingress-admission" also work around .. but i think this is just a bad work around . This don't get what the root is and how i can solve it durable.

There is another yet closed issue related https://github.com/kubernetes/ingress-nginx/issues/6655

cforce commented 3 years ago

I did more digging and it seems the problem is due to the amount of ingresses we have. We have 219, so I think when it validates it checks existing ones as well causing it to fail intermittently when it cannot check all objects and it has no builtin retries on failure.

We have just 3 pods and have the same issue. I don't think is at all related to the number of pods.

cforce commented 3 years ago

to open the 8443

Why shall someone need to open a port? I am using azure kubernetes and ports exposed is only managed by the nginx deployment, so either it should be specified as needed or just not necessary to do anything manually additional which is not supported by standard cloud/kuberntes providers.

cforce commented 3 years ago

I tested manually the admission weebhook endpoints from inside the same cluster different namespace any pod.

app@kubernetes-ado-agent-754c584df8-9wdxw:/vsts$ curl -vvv https://public-nginx-ingress-controller-admission.ingress.svc:443/networking/v1beta1/ingresses?timeout=10s
*   Trying 10.0.100.245:443...
* TCP_NODELAY set
* connect to 10.0.100.245 port 443 failed: Connection refused
* Failed to connect to public-nginx-ingress-controller-admission.ingress.svc port 443: Connection refused
* Closing connection 0
curl: (7) Failed to connect to public-nginx-ingress-controller-admission.ingress.svc port 443: Connection refused

Its indeed a TCP/IP denied not HTTP layer access issue The ip is a cluster ip. Cluster ip between api-server and or any other pod in the same cluster shall be able to reach out to nginx admission service ..if the pod itself exposes the port It seems to do so - it is written in the service-public-nginx-ingress-controller-admission YAML:

spec:
  clusterIP: 10.0.100.245
  ports:
  - name: https-webhook
    port: 443
    protocol: TCP
    targetPort: webhook
jw-s commented 3 years ago

Hey guys, I've also experienced this issue. After doing some debugging it seems like the admission controller just takes too long to respond. Since the webhook timeout is 10s, it means that (in my case) the ingress validation check (which internally, constructs the whole tobe nginx config) takes longer than 10s and hence the timeout or in this case deadline exceeded. Again I don't have concrete evidence to back this statement up, I need to do some timings to really find out... my suspicion is that the pod and thus the container has very little resources to carry out the required config generation in a timely manner - again assumptions.

Workaround: increasing the timeout of the validatingwebhookconfiguration for the ingress controller

cforce commented 3 years ago

timeout...

What is exact controller? yam path of the timeout you mention? Meanwhile i set "controller.admissionWebhooks.enabled: false" as it seems to behave unpredictable

jw-s commented 3 years ago

timeout...

What is exact controller? yam path of the timeout you mention? Meanwhile i set "controller.admissionWebhooks.enabled: false" as it seems to behave unpredictable

In the validatingwebhookconfiguration for the ingress controller

bboychev commented 3 years ago

Probably:

$ grep -H timeout ./ingress-nginx/templates/admission-webhooks/validating-webhook.yaml
./ingress-nginx/templates/admission-webhooks/validating-webhook.yaml:    {{- if .Values.controller.admissionWebhooks.timeoutSeconds }}
./ingress-nginx/templates/admission-webhooks/validating-webhook.yaml:    timeoutSeconds: {{ .Values.controller.admissionWebhooks.timeoutSeconds }}
$

Hint: The timeout value must be between 1 and 30 seconds (currently) :( .

fabiendelpierre commented 3 years ago

Is there guidance on working around this specifically in AKS? I saw mentions of allowing traffic on port 8443 from the Kubernetes control nodes to the worker nodes, but that suggestion applied to GKE, which works differently from AKS.

I've tested this with AKS and a virtual network setup with network security group rules allowing traffic from any to any bidirectionally and still hit this issue.

For the time being, I've simply disabled ValidatingAdmissionWebhook altogether, but that seems like the wrong approach.

MattJeanes commented 3 years ago

@fabiendelpierre I also use AKS and after a lot of troubleshooting this is unfortunately the conclusion I came to as well. The issue for me is that we use helm to deploy many sites at the same time (25 or so) and they all hit the validator at the same time and it just breaks.

We had some luck increasing the timeout for the call but ultimately ended up disabling it as it was causing far more deployment risk than an invalid config file might have, which is a shame.

fabiendelpierre commented 3 years ago

@MattJeanes I appreciate the feedback. I plan on keeping it disabled for the time being. Out of curiosity, what timeout value did you use that worked? I tested up to 30 seconds but it made no difference. Our current usage is way way simpler as we're not yet in prod with AKS so we're just deploying the ingress controller and then throwing a very simple hello-world app at it. So we don't have a case like yours where the validation webhook gets hit hard simultaneously by a bunch of things.

ilovemysillybanana commented 3 years ago

I am also having this issue on a fresh minikube installation, just thought I'd chime in since most issues are being reported by GKE it seems

longwuyuan commented 3 years ago

/remove-kind support

vagdevik commented 3 years ago

I fixed this by using:

kubectl delete -A ValidatingWebhookConfiguration ingress-nginx-admission

danilomo commented 3 years ago

@vagdevik

Thanks, mate!!!!

sanzenwin commented 3 years ago

kubectl delete -A ValidatingWebhookConfiguration ingress-nginx-admission

This way is just concealing the issue, who can provide a best way? I still facing this issue after closing the firewall:

ufw status
Status: inactive
ghost commented 3 years ago

How i resolve this issue by 1.kubectl delete -A ValidatingWebhookConfiguration ingress-nginx-admission

  1. kubectl get job -n ingress-nginx
  2. kubectl delete job ingress-nginx-admission-create ingress-nginx-admission-patch -n ingress-nginx
  3. re-deploy webhook stack (ValidatingWebhookConfiguration , ClusterRole , ClusterRoleBinding , Job ,role , RoleBinding)
  4. wait for pod job complated. ingress-nginx-admission-create-tkfch 0/1 Completed 0 3m56s ingress-nginx-admission-patch-fwc86 0/1 Completed 0 3m56s
  5. deploy ingress kubectl apply -f echo-sever.txt ingress.networking.k8s.io/echo created

Noted this step work on ingress ctl v.0.34.1

mel1nn commented 3 years ago

Hi,

I have.

The validatingwebhook service is not reachable in my private GKE cluster. I needed to open the 8443 port from the master to the pods. On top of that, I then received a certificate error on the endpoint "x509: certificate signed by unknown authority". To fix this, I needed to include the caBundle from the generated secret in the validatingwebhookconfiguration.

A quick fix if you don't want to do the above and have the webhook fully operational is to remove the validatingwebhookconfiguration or setting the failurePolicy to Ignore.

I believe some fixes are needed in the deploy/static/provider/cloud/deploy.yaml as the webhooks will not always work out of the box.

Agree with that. By setting the ca from the nginx-ingress-controller-ingress-nginx-admission secret in the caBundle field of the ValidatingWebhookConfiguration, it works.

Why this field is not set by default during the nginx-ingress-controller-ingress-nginx-admission-create Job ? @aledbf

turgutsaricam commented 3 years ago

If you have this issue with Minikube

What actually worked for me was to remove everything related to the Ingress controller, which are the ingress-nginx-controller-admission service, ingress-nginx-controller deployment, and the related Docker images. Some of these steps might not be necessary, I do not know. I just removed everything and let Minikube download the Docker images and create the controller again.

# Remove the Ingress-related service and deployment
kubectl delete svc ingress-nginx-controller-admission
kubectl delete deployment ingress-nginx-controller

# Disable the ingress addon
minikube addons disable ingress

# Remove the Docker containers
eval $(minikube -p minikube docker-env)
docker image ls | grep ingress # Find the Ingress-related Docker images
docker image rm <IDs of Ingress-related containers>

# Enable the Ingress addon
minikube addons enable ingress

After these steps, Ingress worked as it should. I also restarted Minikube after this to make sure I did not get the issue again. After restart, everything worked as they should.

I am not saying this will definitely work for you. I simply wanted to share what worked for me with you.

strongjz commented 3 years ago

There we're some issues fixed with https://github.com/kubernetes/ingress-nginx/pull/7255

Can you try v0.48.1 and report back if there is still an issue.

vladimirkhs commented 3 years ago

There we're some issues fixed with #7255

Can you try v0.48.1 and report back if there is still an issue.

Stumbled upon this issue with controller: v0.48.1. Solved by rolling back kube-webhook-certgen to v1.2.2 (the problem was on version 1.5.1)

Tibor17 commented 3 years ago

@strongjz After upgrading the addon, the ingress did not fail.

minikube addons disable ingress
minikube addons enable ingress --images="IngressController=ingress-nginx/controller:v0.48.1"

In prior, I updated Minikube and k8s too, minikube-upgrade.sh 1.22.0. I use the latest versions, did NOT rollback kube-webhook-certgen.

#! /bin/sh

# Minikube update script file /root/minikube-upgrade.sh 1.22.0

minikube delete && \
rm -rf /usr/local/bin/minikube && \
curl -LO minikube "https://github.com/kubernetes/minikube/releases/download/v$1/minikube_$1-0_amd64.deb" && \
chmod +x "minikube_$1-0_amd64.deb" && \
dpkg -i "minikube_$1-0_amd64.deb" && \
rm "minikube_$1-0_amd64.deb" && \
minikube config set driver none && \
minikube start --force=true --driver=none &&\

# Enabling addons: ingress, dashboard
minikube addons enable ingress && \
minikube addons enable dashboard && \
minikube addons enable metrics-server && \
# Showing enabled addons
echo '\n\n\033[4;33m Enabled Addons \033[0m' && \
minikube addons list | grep STATUS && minikube addons list | grep enabled && \

# Showing current status of Minikube
echo '\n\n\033[4;33m Current status of Minikube \033[0m' && minikube status
echo '\n\n\033[4;33m Installed version of Minikube \033[0m' && minikube version

Verify the addon:

$ minikube addons list | grep ingress
| ingress                     | minikube | enabled ✅   | unknown (third-party) |
| ingress-dns                 | minikube | disabled     | unknown (third-party) |
$ kubectl get pods -n ingress-nginx
NAME                                        READY   STATUS      RESTARTS   AGE
ingress-nginx-admission-create-gd5gr        0/1     Completed   0          107s
ingress-nginx-admission-patch-vzfc9         0/1     Completed   1          107s
ingress-nginx-controller-65ccdd7598-nqmr5   1/1     Running     0          107s
$ kubectl exec -it ingress-nginx-controller-65ccdd7598-nqmr5  -n ingress-nginx -- /nginx-ingress-controller --version
NGINX Ingress controller
  Release:       v0.48.1
  Build:         30809c066cd027079cbb32dccc8a101d6fbffdcb
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.20.1

Then verify the upgrade by using the test.

iamNoah1 commented 3 years ago

@Tibor17 @vladimirkhs @strongjz can we then consider this issue as obsolete?

IAXES commented 3 years ago

@iamNoah1 Hmmm, maybe not yet: I'm reproducing this lately with KIND (i.e. v1.21 release of kindest). My only workaround at the moment is purging the web hook via kubectl delete -A ValidatingWebhookConfiguration ingress-nginx-admission. I'll post a minimal working example this evening.

willzhang commented 3 years ago

@aduncmj I found this solution https://stackoverflow.com/questions/61365202/nginx-ingress-service-ingress-nginx-controller-admission-not-found

kubectl delete -A ValidatingWebhookConfiguration ingress-nginx-admission

good

beefsack commented 3 years ago

I fixed this by using:

kubectl delete -A ValidatingWebhookConfiguration ingress-nginx-admission

We were able to remove our dependency on this workaround from our development environments over the past couple of weeks, so the issue is resolved for our use case.

iamNoah1 commented 3 years ago

@IAXES can you maybe elaborate a little bit more on you the environment? Which version of ingress-nginx did you use? Which version of K8s is it and what is your use case? How can we reproduce it? Also, did you consider using newer versions of ingress-nginx?

ahern2018 commented 3 years ago

I modified the configuration of ValidatingWebhookConfiguration, mainly the scope of apiGroups

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  labels:
    helm.sh/chart: ingress-nginx-3.23.0
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/version: 0.44.0
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
  name: ingress-nginx-admission
webhooks:
  - name: validate.nginx.ingress.kubernetes.io
    matchPolicy: Equivalent
    rules:
      - apiGroups:
          - ""
        apiVersions:
          - v1beta1
        operations:
          - CREATE
          - UPDATE
        resources:
          - ingresses
    failurePolicy: Fail
    sideEffects: None
    admissionReviewVersions:
      - v1
      - v1beta1
    clientConfig:
      service:
        namespace: ingress-nginx
        name: ingress-nginx-controller-admission
        path: /networking/v1beta1/ingresses
matteovivona commented 3 years ago

In my GKE cluster I've manually increased timeoutSeconds to 30.

You can do it via Helm:

controller:
  admissionWebhooks:
    enabled: true
    timeoutSeconds: 45
willzhang commented 3 years ago

i install with old version 3.35.0, because my k8s version is v1.18.3

helm install nginx-ingress ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --version 3.35.0 \
  --set controller.image.registry=willdockerhub \
  --set controller.image.image=ingress-nginx-controller \
  --set controller.image.tag=v0.48.1 \
  --set controller.image.digest="" \
  --set controller.hostNetwork=true \
  --set controller.kind=DaemonSet \
  --set controller.service.type=ClusterIP \
  --set controller.hostPort.enable=true \
  --set controller.hostPort.http=80 \
  --set controller.hostPort.https=443 \
  --set controller.nodeSelector.node=ingress

error

[root@disaster-cluster nginx-app]# kubectl -n ns-panda apply -f ingress.yaml 
Error from server (InternalError): error when creating "ingress.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://nginx-ingress-ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1beta1/ingresses?timeout=10s: context deadline exceeded

it work,https://github.com/kubernetes/ingress-nginx/issues/5401#issuecomment-662424306

[root@disaster-cluster nginx-app]# kubectl get ValidatingWebhookConfiguration
NAME                                    WEBHOOKS   AGE
nginx-ingress-ingress-nginx-admission   1          3h13m

[root@disaster-cluster nginx-app]#  kubectl delete -A ValidatingWebhookConfiguration nginx-ingress-ingress-nginx-admission
validatingwebhookconfiguration.admissionregistration.k8s.io "nginx-ingress-ingress-nginx-admission" deleted

[root@disaster-cluster nginx-app]#  kubectl apply -f ingress.yaml -n ns-panda
ingress.networking.k8s.io/wordpress created
shakaib-arif commented 3 years ago

Environment Detail:

Output from ingress controller pod

bash-5.1$ /nginx-ingress-controller --version

NGINX Ingress controller Release: v1.0.0 Build: 041eb167c7bfccb1d1653f194924b0c5fd885e10 Repository: https://github.com/kubernetes/ingress-nginx nginx version: nginx/1.20.1

- I'm using 2 ingresses with a single controller and each one of them has different domains and sub-domains.
 They are configured with the cert manager for SSL.

Error Log:
---
- kubectl log:

Error from server (InternalError): error when creating "ingress/filename.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://nginx-ingress-ingress-nginx-controller-admission.development.svc:443/networking/v1/ingresses?timeout=10s": context deadline exceeded

- Ingress log:

E1018 09:15:09.355840 7 server.go:82] "failed to process webhook request" err="rejecting admission review because the request does not contain an Ingress resource but networking.k8s.io/v1beta1, Kind=Ingress with name ingressName in namespace development" E1018 09:15:09.413602 7 server.go:82] "failed to process webhook request" err="rejecting admission review because the request does not contain an Ingress resource but networking.k8s.io/v1beta1, Kind=Ingress with name ingressName in namespace development"

My Resolution
----
In my AKS cluster, **I have increased the timeout** to `timeoutSeconds: 30`.

Thanks @tehKapa, for your comment it saved my day [#5401 (comment)](https://github.com/kubernetes/ingress-nginx/issues/5401#issuecomment-911308097)

- kubectl log:
`ingress.networking.k8s.io/ingressName configured`

- Ingress log:

1018 09:26:47.467258 7 main.go:101] "successfully validated configuration, accepting" ingress="ingressName/development" I1018 09:26:47.477962 7 event.go:282] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"development", Name:"ingressName", UID:"8d8ae8ef-e33c-4a2c-8309-b98550f69e1d", APIVersion:"networking.k8s.io/v1", ResourceVersion:"10524872", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync I1018 09:26:47.478136 7 controller.go:150] "Configuration changes detected, backend reload required" I1018 09:27:29.193672 7 controller.go:167] "Backend successfully reloaded"

strongjz commented 2 years ago

How many ingress objects are in the cluster? That can cause timeouts if there are large amounts of objects.

iamNoah1 commented 2 years ago

@shakaib-arif friendly reminder for answering the question of @strongjz :)

pieveee commented 2 years ago

Going through this bloated thread and it seems there is no "general" solution yet?

Error from server (InternalError): error when creating "ingress-demo.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1/ingresses?timeout=10s": context deadline exceeded

I have the same issue within my bare-metal environment running on Debian 11 and with UFW. Is the deletion of ValidatingWebhookConfiguration considered safe? Moreover, disabling ufw doesn't make the change.

shakaib-arif commented 2 years ago

How many ingress objects are in the cluster? That can cause timeouts if there are large amounts of objects.

@strongjz - In my cluster, I have 3 DNS configured using 6 ingress objects through a single ingress controller.

xeor commented 2 years ago

I have a cluster where I use renovate to keep versions up to date. I redeploy the cluster almost every day, but the last couple of days, maybe a week, I have started seeing this error as well. Both the mentioned certificate issue, but also a timeout. This have recently started happening, so this issue might have just become worse in the last couple of releases..

I have ~25 ingresses and run the ingress-controller as a daemonset

anuj0701 commented 2 years ago

It seems there are multiple errors for "failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1/ingresses?timeout=10s":

Like below:

  1. context deadline exceeded
  2. x509: certificate signed by unknown authority
  3. Temporary Redirect
  4. EOF
  5. no endpoints available for service "ingress-nginx-controller-admission" ...and many more.

The one I'm facing as soon as I apply the ingress resource file (rules file) is: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ingress-nginx-controller-admission.ingress-nginx.svc:443/networking/v1/ingresses?timeout=30s": EOF

After which the ingress controller gets restarted as below: NAME READY STATUS RESTARTS AGE ingress-nginx-controller-5cf97b7d74-zvrr6 1/1 Running 6 30m ingress-nginx-controller-5cf97b7d74-zvrr6 0/1 OOMKilled 6 30m ingress-nginx-controller-5cf97b7d74-zvrr6 0/1 CrashLoopBackOff 6 30m ingress-nginx-controller-5cf97b7d74-zvrr6 0/1 Running 7 31m ingress-nginx-controller-5cf97b7d74-zvrr6 1/1 Running 7 32m

One possible solution could be (not sure though) mentioned: https://stackoverflow.com/a/69289313/12241977

But not sure it could possibly work in case of Managed Kubernetes services like AWS EKS as we don't have access to kube-api server.

Also the section "kind: ValidatingWebhookConfiguration" has below field from yaml: https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.0/deploy/static/provider/baremetal/deploy.yaml

clientConfig:
  service:
    namespace: ingress-nginx
    name: ingress-nginx-controller-admission
    path: /networking/v1/ingresses

So where does the "path: /networking/v1/ingresses" do & where it is reside or simply where we can find this path?

aldemira commented 2 years ago

I'm having the same issue on a bare metal Debian 11 installation with flannel 1.15.1, and it turns out this is a problem with the flannel itself. Switching from vxlan to host-gw did it for me. Here is the issue I created: https://github.com/flannel-io/flannel/issues/1511

phxism commented 2 years ago

same issue when deploy with NodePort Follow the https://kubernetes.github.io/ingress-nginx/deploy/#bare-metal-clusters I compare the content of NodePort with Cloud (LoadBalancer) and find there is a little difference that cloud/deploy.yaml has the - --publish-service=$(POD_NAMESPACE)/ingress-nginx-controller args. I add this to bare-metal-clusters‘s yaml ,then kubectl apply -f ... Everything goes fine!

alcidesmig commented 2 years ago

@iamNoah1 Hmmm, maybe not yet: I'm reproducing this lately with KIND (i.e. v1.21 release of kindest). My only workaround at the moment is purging the web hook via kubectl delete -A ValidatingWebhookConfiguration ingress-nginx-admission. I'll post a minimal working example this evening.

For me, it helped to solve the problem. My cluster had two ValidatingWebhookConfiguration (due to one wrong installation) and deleting the outdated one solved the issue. Thank you.

$ kubectl get ValidatingWebhookConfiguration -A
NAME                          WEBHOOKS   AGE
ingress-nginx-admission       1          87s
tst-ingress-nginx-admission   1          27d
$ kubectl delete ValidatingWebhookConfiguration tst-ingress-nginx-admission -A
validatingwebhookconfiguration.admissionregistration.k8s.io "tst-ingress-nginx-admission" deleted`
xeor commented 2 years ago

It looks like the ValidatingWebhookConfiguration is missing some client config. https://github.com/kubernetes/ingress-nginx/issues/5968#issuecomment-849772666 fixes it without deleting by getting the CA-bundle, and adding it to /webhooks/0/clientConfig/caBundle

The function that should do this is at https://github.com/kubernetes/ingress-nginx/blob/54523641a89a2b026180eb1e779152b8e939b11a/images/kube-webhook-certgen/rootfs/pkg/k8s/k8s.go#L110 and there is a comment there saying Intentionally don't wrap error here to preserve old behavior and be able to log both original error and a message... Does that means that the patching can fail, and it won't really care?

danieldavies99 commented 2 years ago

I encountered this issue after updating from kubernetes v1.21 to kubernetes v1.22

For me, the issue was tied to cert gen, specifically I was using docker.io/jettech/kube-webhook-certgen:v1.5.1 which relies on the old v1beta1api

The fix for me was to replace all instances of docker.io/jettech/kube-webhook-certgen:v1.5.1 in my helm chart with k8s.gcr.io/ingress-nginx/kube-webhook-certgen:v1.1.1@sha256:64d8c73dca984af206adf9d6d7e46aa550362b1d7a01f3a0a91b20cc67868660

related issue: https://github.com/kubernetes/ingress-nginx/issues/7418

(I am running my config locally using docker desktop, I also had to purge all my containers on docker desktop and reset my kubernetes cluster, then run the helm install command again for the changes to come into effect)

my deployments file looks like this now:

apiVersion: batch/v1
kind: Job
metadata:
  name: ingress-nginx-admission-patch
  annotations:
    helm.sh/hook: post-install,post-upgrade
    helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
  labels:
    helm.sh/chart: ingress-nginx-4.0.15
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/version: 1.1.1
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/component: admission-webhook
  namespace: {{ .Values.namespace }}
spec:
  template:
    metadata:
      name: ingress-nginx-admission-patch
      labels:
        helm.sh/chart: ingress-nginx-4.0.15
        app.kubernetes.io/name: ingress-nginx
        app.kubernetes.io/instance: ingress-nginx
        app.kubernetes.io/version: 1.1.1
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/component: admission-webhook
    spec:
      containers:
        - name: patch # line below is what I had to change, old value: docker.io/jettech/kube-webhook-certgen:v1.5.1
          image: k8s.gcr.io/ingress-nginx/kube-webhook-certgen:v1.1.1@sha256:64d8c73dca984af206adf9d6d7e46aa550362b1d7a01f3a0a91b20cc67868660
          imagePullPolicy: IfNotPresent
          args:
            - patch
            - --webhook-name=ingress-nginx-admission
            - --namespace=$(POD_NAMESPACE)
            - --patch-mutating=false
            - --secret-name=ingress-nginx-admission
            - --patch-failure-policy=Fail
          env:
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
      restartPolicy: OnFailure
      serviceAccountName: ingress-nginx-admission
      securityContext:
        runAsNonRoot: true
        runAsUser: 2000
strongjz commented 2 years ago

@danieldavies99 Did you do an upgrade and it failed? All the new charts and static deploys should be using the ingress-nginx maintained kube-webhook-certgen image now.

https://github.com/kubernetes/ingress-nginx/blob/main/charts/ingress-nginx/values.yaml#L623

den-is commented 2 years ago

I've got very same issue on clean and fresh EKS 1.21 install without any addons, CNI, NetworkPolicies, firewalls, etc. Same nginx-ingress is working on my test k3d setup. I've tested couple more recent older version of ingress-controller - none worked on EKS. Increasing request timeout doesn not help. Removing ValidatingWebhookConfiguration helps.

But IMHO that's not normal to just delete something to get it working. I can't find exact root cause of the issue in any threads for that problem either.

Why pods under the same namespace are not able to communicate with ingress-controller admission webhook?

curl -v https://ingress-nginx-controller-admission.shared-qc.svc:8443/networking/v1/ingresses
*   Trying 172.20.214.26:8443...
longwuyuan commented 2 years ago

Can you post information and command and outputs that shows that the securitygroups or host os packet-filtering is not blocking the required ports

den-is commented 2 years ago

I was able to find cause of my issue. Fresh EKS was built using the latest community supported terraform-eks v18.2 In the v18 maintainers made sg rules much more stricter and allowing only specific k8s ports communications.

Allowing all traffic from masters to workers node made things work as intended. (well for this ingress you at least need to allow port 8443)

Can you post information and command and outputs that shows that the securitygroups or host os packet-filtering is not blocking the required ports

longwuyuan commented 2 years ago

Feel free to reopen this, if a bug or a problem is proven, using debug data. /close