kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.83k stars 1.41k forks source link

aws-load-balancer-webhook-service error for non alb Ingresses #2071

Closed gazal-k closed 2 years ago

gazal-k commented 3 years ago

We are trying to migrate from ingress-nginx to aws-load-balancer-controller. We are starting by just installing the controller chart. The plan is to template our applications to use the new ingress.class alb and then migrate them.

But after installing aws-load-balancer-controller, we are seeing errors on our existing applications like:

cannot patch "app1-ingress" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca"): cannot patch "app1-ingress" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")

app1-ingress still uses kubernetes.io/ingress.class: nginx. Can we skip the webhook from modifying those?

M00nF1sh commented 3 years ago

@gazal-k It seems the webhook's CA is configured incorrectly. If you install from helm, did you reverted the webhookConfiguration/cert Secret to make them drift?

The webhook for Ingress is merely validating changes(for security features we included with IngressClassParams support), and it'll just permit any changes for other Ingresses using other IngressClasses.

It's fine to delete the ValidatingWebhookConfiguration if it's a single cluster use, but you should fix the certificate ideally.

gazal-k commented 3 years ago

I'm not entirely sure what may have happened. Because this wasn't me personally who installed the chart. @anilkumarpasupuleti do u know if we reverted webhookConfiguration/cert Secret during installation?

I just checked the installation instructions now. It's not a straight helm install, is it? Do the CRDs have to be done manually first? Can not doing that cause this issue?

gazal-k commented 3 years ago

Is this the Secret we are talking about: https://github.com/aws/eks-charts/blob/master/stable/aws-load-balancer-controller/templates/webhook.yaml#L131-L159 ?

gazal-k commented 3 years ago

We also checked this somewhat related #1909. Verified that the caBundle in the webhooks matched the Secret; aws-load-balancer-tls

gazal-k commented 3 years ago

It's fine to delete the ValidatingWebhookConfiguration if it's a single cluster use, but you should fix the certificate ideally.

Is that validatingwebhookconfigurations/aws-load-balancer-webhook? I'm no sure what you mean by "single cluster use". We plan on installing aws-load-balancer-controller on each of our clusters. Wasn't aware a single installation can manage across clusters. (not that we'd need it even so)

anilkumarpasupuleti commented 3 years ago

This issue does not exist on chart version 1.1.6, its on k8s v1.18

kishorj commented 3 years ago

@anilkumarpasupuleti, the validating webhook is included in chart v1.20 (app v2.2.0) and later.

gazal-k commented 3 years ago

We got the error in chart version 1.20. We tried downgrading and obviously, no validating webhook no errors.

Anil was just mentioning that we're trying all of this in EKS 1.18. (not to be confused with the chart version)

kishorj commented 3 years ago

@gazal-k could you compare the caBundle configured in your validating webhook aws-load-balancer-webhook and the secret aws-load-balancer-tls?

caBundle configured in webhook

kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io aws-load-balancer-webhook -ojsonpath={.webhooks[0].clientConfig.caBundle}  | base64 -d  | openssl x509 -noout -text

CA certificate from the secret

kubectl get secret -n kube-system  aws-load-balancer-tls -ojsonpath="{.data.ca\.crt}" |base64 -d   |openssl x509 -noout -text

Also verify the webhook certificate is issued by the configured CA

get secret -n kube-system  aws-load-balancer-tls -ojsonpath="{.data.tls\.crt}" |base64 -d   |openssl x509 -noout -text
gazal-k commented 3 years ago

I was able to verify installing & updating apps using other ingress controllers using helm & helmfile just fine. It seems like a hemlfile execution from our CI/CD pipeline is where we are seeing the error. This might be a tooling issue. We'll try updating our pipeline containers with newer versions of helm & helmfile & see if that helps.

Thanks @kishorj & @M00nF1sh

darrenwhighamfd commented 3 years ago

I have also had this issue when using Polaris https://artifacthub.io/packages/helm/fairwinds-stable/polaris on version 4.0.4

Error: UPGRADE FAILED: cannot patch "polaris" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")

We use cert-manager as detailed with the Polaris documentation and currently running EKS 1.20

M00nF1sh commented 3 years ago

@darrenwhighamfd Is this issue still happening in your cluster? have you checked whether the cert stored in secret matches the CA configured in aws-load-balancer-webhook ValidatingWebhookConfiguration?

DmitriyStoyanov commented 3 years ago

Faced with the same issue, with 2 ingress controllers installed on one cluster aws-load-balancer controller and nginx-ingress controller with We have automated CI with updating every week for new versions from helm, and after some upgrade during creation of ingresses with annotation kubernetes.io/ingress.class: "nginx" we faced with issue: aws-load-balancers validation webhook started to handle such requests, with error, related to certificate.

Error: UPGRADE FAILED: cannot patch "kube-prometheus-stack-grafana" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")

Fixed the issue by uninstallation of helm chart aws-load-balancer controller and installing it back again.

Hm, just now I see, that issue spontaneously still exist during deploy of some helm charts with Ingress. Triggering upgrade again, deploy successfully. So issue still exist. Looks like web hook validation do not check annotations such as kubernetes.io/ingress.class or ingressClass in Ingress spec

darrenwhighamfd commented 3 years ago

@M00nF1sh yes we still have this issue, currently we have had to revert to 1.1.6 Similar to @DmitriyStoyanov this issue did not every time happen but spontaneously. One out of every three or so runs. The cert and secret look to match.

gazal-k commented 3 years ago

I guess that's our experience too. Maybe it wasn't the tooling. It was just that the issue was sporadic. We had verified the certs and secrets matching; https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2071#issuecomment-862869696

DmitriyStoyanov commented 3 years ago

ha, found that uninstallation of helm chart, do not remove secrets.

aws-externaldns-token-wkqpc                      kubernetes.io/service-account-token   3      177d
aws-load-balancer-controller-token-spwhd         kubernetes.io/service-account-token   3      177d

So I've tried again

helm uninstall -n kube-system aws-load-balancer-controller
k delete secret -n kube-system aws-load-balancer-controller-token-spwhd
k delete secret -n kube-system aws-externaldns-token-wkqpc

and then install it again, by the way, it created 3 new secrets:

aws-externaldns-token-49mpg                          kubernetes.io/service-account-token   3      72s
aws-load-balancer-controller-token-rzcvj             kubernetes.io/service-account-token   3      86s
aws-load-balancer-tls                                kubernetes.io/tls                     3      49s

possibly it will help. After that deployment of our helm chart with ingress worked fine. Do not know if this fix whole problem or not

kishorj commented 3 years ago

@DmitriyStoyanov, @gazal-k, @darrenwhighamfd if you are able to reproduce the issue, could you please create a support ticket with AWS support with cluster ARN?

acim commented 2 years ago

Same problem here:

Error: UPGRADE FAILED: cannot patch "merged" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")

clayvan commented 2 years ago

So I think I found my issue, and possibly what others here are seeing.

@DmitriyStoyanov was close in saying

Looks like web hook validation do not check annotations such as kubernetes.io/ingress.class or ingressClass in Ingress spec

But maybe the key here is that the Ingress spec uses ingressClassName not ingressClass.

The documentation https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/ingress/ingress_class/#deprecated-kubernetesioingressclass-annotation looks to be specific to the annotation for alb itself still being supported, but if you're using a secondary ingress controller (nginx or traefik) the annotation for those will cause the error reported above.

So the "fix" is to update any Ingress objects that are using the annotation style like kubernetes.io/ingress.class: traefik to using the new IngressClass spec like:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: default-backend-traefik
  namespace: ingress
spec:
  ingressClassName: traefik

and also define your IngressClass object.

apiVersion: networking.k8s.io/v1beta1
kind: IngressClass
metadata:
  name: traefik
spec:
  controller: traefik.io/ingress-controller

If I had to guess, the AWS LB controller is not checking the annotations for ingress class, and therefore defaulting your Ingress objects for traefik to be the ALB class and causing the error.

clayvan commented 2 years ago

Well, I take that back. I think it was in fact sporadic like others here are reporting. I don't know what the common pattern is though.

For me, we utilize spinnaker + helm template to generate a single manifest yaml file with all objects in it. This also includes those Ingress objects that are of the traefik/nginx class. If I remove these objects from the manifest, the kubectl apply succeeds just fine.

I'm at the point where I'm wondering if I need to break my helm chart up and have to separately deploy AWS LB controller (probably should have done this from the start).

Can everyone else confirm that they also only see this when applying a manifest file that includes both the AWS LB controller install AND ingress objects for nginx/traefik?

gazal-k commented 2 years ago

We use helmfile and specifically helmfile sync command to manage all applications within a cluster (some nesting of helmfiles). So it's like calling helm upgrade --install on all charts in the cluster when there's a change. The issue was sporadic for us.

We are currently just using the older version of the chart; 1.1.6. Our plan is to use albc for all our Ingresses, so we might upgrade once that is done. This might not be a workaround that everyone can use.

DmitriyStoyanov commented 2 years ago

It is interesting, today faced with this issue while deployment of helm chart with ingress with

annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:xxxxxxx:xxxxxxxx:certificate/xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx

and in logs Error: UPGRADE FAILED: cannot patch "helm-chart-name" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")

After retry, it is fixed as many times before.

So possible issue not only related to two nginx and alb ingress controllers we have in cluster, but also with aws acm certificate check?

mahaffey commented 2 years ago

We are using an alb ingress with an acm cert. This error is happening periodically with no configuration changes.

log:

2021/09/16 06:56:31 http: TLS handshake error from 10.0.3.214:37640: remote error: tls: bad certificate
{"level":"error","ts":1631775391.605741,"logger":"controller","msg":"Reconciler error","controller":"service","name":"rabbitmq","namespace":"rabbitmq","error":"Internal error occurred: failed calling webhook \"vtargetgroupbinding.elbv2.k8s.aws\": Post https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"aws-load-balancer-controller-ca\")"}

ingress:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/backend-protocol: HTTP
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-2:123456789:certificate/omit
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: "30"
    alb.ingress.kubernetes.io/healthcheck-port: "15672"
    alb.ingress.kubernetes.io/healthcheck-protocol: HTTP
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "20"
    alb.ingress.kubernetes.io/inbound-cidrs: 1.1.1.1/32,omit...
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443},{"HTTP":80}]'
    alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=300
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-FS-1-2-2019-08
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/tags: omit
    alb.ingress.kubernetes.io/target-type: ip
    external-dns.alpha.kubernetes.io/hostname: rabbitmq.company.com
    kubernetes.io/ingress.class: alb
  labels:
    app.kubernetes.io/instance: rabbitmq
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: rabbitmq
    helm.sh/chart: rabbitmq-8.22.1
  name: rabbitmq
  namespace: rabbitmq
spec:
  rules:
  - host: rabbitmq.company.com
    http:
      paths:
      - backend:
          serviceName: rabbitmq-stats
          servicePort: http-stats
        path: /*
mahaffey commented 2 years ago

EDIT: this is not the case, I am still seeing the error. It is intermittent,

Update, I found that when I remove the annotation alb.ingress.kubernetes.io/target-type: ip this error ceases to occur

acim commented 2 years ago

Update, I found that when I remove the annotation alb.ingress.kubernetes.io/target-type: ip this error ceases to occur

I do not have this annotation at all but still facing the problem.

acim commented 2 years ago

I use helmfile and it shows diff of certificates on each run, like described in this issue: https://github.com/aws/eks-charts/issues/347

There is non-documented value enableCertManager which is false by default so certificates get regenerated on each run of helmfile apply. I suspect that this may be the reason because if I run helmfile with --selector flag and update just on chart, the error seems not to appear. But when I run helmfile over the whole stack, it appears almost always. It can be that controller is not fast enough to reconfigure the certificates. Just guessing. And to clear up, I have no non-ALB ingresses, all are just ALB.

mahaffey commented 2 years ago

Update, I found that when I remove the annotation alb.ingress.kubernetes.io/target-type: ip this error ceases to occur

Update, I found that when I remove the annotation alb.ingress.kubernetes.io/target-type: ip this error ceases to occur

I do not have this annotation at all but still facing the problem.

Hey sorry about the confusion, I am still seeing this error!

gartemiev commented 2 years ago

I'm also having this issue.

@kishorj @M00nF1sh , any ETA when you fix it?

gazal-k commented 2 years ago

Would https://github.com/kubernetes-sigs/aws-load-balancer-controller/pull/2264 resolve this issue?

ruckc commented 2 years ago

having it also

booleanbetrayal commented 2 years ago

Ran into this today. Fixed by pruning all related resources and re-installing.

scardena commented 2 years ago

I ran into this today after upgrading the helm deployment from APP version 2.0.1 to 2.2.4. I had to remove the deployment and reinstall.

henrysachs commented 2 years ago

I just ran into this via the CDK HelmChartProvider. Solution was to try it 3 times... As this is done via Cloudformation the Stack just rolled back and I could do this easily but feels kinda hacky.

tiwarisuraj9 commented 2 years ago

I face this issue on a regular basis while applying new/updating nginx ingress configured to use LB via AWS load balancer controller. The only solution I found is to delete all resources of AWS lb controller and reinstall. (I usually do it though argocd, so it's pretty quick reinstallation)

caleb15 commented 2 years ago

@kishorj I've created a ticket with AWS support. CaseID 9282832861

Erlichooo commented 2 years ago

@kishorj I've created a ticket with AWS support. CaseID 9282832861

Did you have any reply?

caleb15 commented 2 years ago

Yep, they said they were working on fixing it but couldn't share an ETA.

alex-treebeard commented 2 years ago

Hi there, quick data point on this one.

I find that I get this error when I install both the controller and a CRD which uses it in the same helm chart. I.e. my chart has the conroller as a subchart, and a targetgroup binding as a template.

Here is my error:

Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.xxx.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")

If I install the controller first, then install the targetgroupbinding after a delay, it works.

gazal-k commented 2 years ago

Reasonably confident this is the same as #2239. https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/2e181b2cf31d41ce812deb18da629a5b0631144e/helm/aws-load-balancer-controller/values.yaml#L137

We have set keepTLSSecret: true. Our GitOps pipeline that uses helmfile sync has executed a dozen times after the change and we have yet to see this issue.

stijnbrouwers commented 2 years ago

Reasonably confident this is the same as #2239.

https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/2e181b2cf31d41ce812deb18da629a5b0631144e/helm/aws-load-balancer-controller/values.yaml#L137

We have set keepTLSSecret: true. Our GitOps pipeline that uses helmfile sync has executed a dozen times after the change and we have yet to see this issue.

Can confirm that this solution also fixed it for me.

karanrn commented 2 years ago

We are installing it through helm via ArgoCD. We set keepTLSSecret: True yet we periodically see the issue

jakepage commented 2 years ago

Reasonably confident this is the same as #2239. https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/2e181b2cf31d41ce812deb18da629a5b0631144e/helm/aws-load-balancer-controller/values.yaml#L137

We have set keepTLSSecret: true. Our GitOps pipeline that uses helmfile sync has executed a dozen times after the change and we have yet to see this issue.

Can confirm that this solution also fixed it for me.

I tried this, but the issue still keeps periodically popping up too

vkirilenko commented 2 years ago

Today I got the same error on an AWS cluster. I had to remove ns and reinstall everything from scratch.

arthurk commented 2 years ago

Still happening. aws-lb-controller version is latest 2.4.1 and keepTLSSecret is true.

sidewinder12s commented 2 years ago

I also ran into this today across multiple clusters. The TLS secret has not changed since deployment as far as I can tell and was not regenerated. Did anyone have any idea whats wrong?

As far as I can tell the TLS secret in the cluster from the helm chart should be valid for 10 years.

carozuniga commented 1 year ago

Still happening. The controller version is 2.4.3

arthurk commented 1 year ago

I've fixed the problem by generating a cert manually and set it via webhookTLS. Using the certManager integration also works, both fix the problem for me. I've seen a few other charts that generate TLS certs via webhook and all of them have a similar problems

nickjj commented 1 year ago

I'm also seeing this on a very regular basis. At least 2-5 times a day.

Environment:

It's happened 164 times and I think it started in August 2022 when I updated the helm chart to 1.4.4. With 1.4.3 or older it only happened once. That means it happened 163 times with a combo of 1.4.4, 1.4.5 and 1.4.6. I'm basing this on looking at the last year's worth of Kubernetes events that match the same error message as this issue.

LockedThread commented 1 year ago

Still happening, EKS, installed via helm version 1.5.3

omidraha commented 1 year ago

I got same error by using kubernetes.helm.sh/v3.Chart in Pulumi.

kubernetes.helm.v3.Chart(
    "lb-dev",
    kubernetes.helm.v3.ChartOpts(
        chart="aws-load-balancer-controller",
        fetch_opts=kubernetes.helm.v3.FetchOpts(
            repo="https://aws.github.io/eks-charts"
        ),
        namespace="aws-lb-controller-dev",
        values={
            "region": "us-west-2",
            "serviceAccount": {
                "name": "aws-lb-controller-serviceaccount",
                "create": False,
            },
            "vpcId": vpc.vpc_id,
            "clusterName": "cluster-dev",
            "podLabels": {
                "stack": stack,
                "app": "aws-lb-controller"
            },
            "autoDiscoverAwsRegion": "true",
            "autoDiscoverAwsVpcID": "true",
            "keepTLSSecret": True,
        },
    ), pulumi.ResourceOptions(
        provider=eks_provider, parent=namespace
    )
)
ingress = kubernetes.networking.v1.Ingress(
    f"app-ingress-dev",
    metadata=kubernetes.meta.v1.ObjectMetaArgs(
        name=f'ingress-dev',
        namespace="aws-lb-controller-dev",
        annotations={
            "kubernetes.io/ingress.class": "alb",
            "alb.ingress.kubernetes.io/target-type": "instance",
            "alb.ingress.kubernetes.io/scheme": "internet-facing"

        },
        labels={'app': 'ingress-dev'},
    ),
    spec=kubernetes.networking.v1.IngressSpecArgs(
        rules=[kubernetes.networking.v1.IngressRuleArgs(
            host='example.com',
            http=kubernetes.networking.v1.HTTPIngressRuleValueArgs(
                paths=[
                    kubernetes.networking.v1.HTTPIngressPathArgs(
                        path="/app1",
                        path_type="Prefix",
                        backend=kubernetes.networking.v1.IngressBackendArgs(
                            service=kubernetes.networking.v1.IngressServiceBackendArgs(
                                name=svc_01.metadata.name,
                                port=kubernetes.networking.v1.ServiceBackendPortArgs(
                                    number=80,
                                ),
                            ),
                        ),
                    ),
                ],
            ),
        )],
    ),
    opts=pulumi.ResourceOptions(provider=eks_provider)
)

Info

kubectl describe ingress  ingress-dev -n aws-lb-controller-dev
Name:             ingress-name-dev
Labels:           app=ingress-name-dev
                  app.kubernetes.io/managed-by=pulumi
Namespace:        aws-lb-controller-dev
Address:          k8s-awslbcon-ingressn-63280fface-152845941.us-west-2.elb.amazonaws.com
Ingress Class:    <none>
Default backend:  <default>
Rules:
  Host         Path  Backends
  ----         ----  --------
  example.com  
               /app1   eks-service-dev-443996a5:80 (10.0.18.89:80)
Annotations:   alb.ingress.kubernetes.io/scheme: internet-facing
               alb.ingress.kubernetes.io/target-type: instance
               kubernetes.io/ingress.class: alb
Events:
  Type     Reason                  Age                 From     Message
  ----     ------                  ----                ----     -------
  Warning  FailedDeployModel       23m (x14 over 24m)  ingress  Failed deploy model due to Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.aws-lb-controller-dev.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")
  Normal   SuccessfullyReconciled  22m (x3 over 3h7m)  ingress  Successfully reconciled