Closed gazal-k closed 2 years ago
@gazal-k It seems the webhook's CA is configured incorrectly. If you install from helm, did you reverted the webhookConfiguration/cert Secret to make them drift?
The webhook for Ingress is merely validating changes(for security features we included with IngressClassParams support), and it'll just permit any changes for other Ingresses using other IngressClasses.
It's fine to delete the ValidatingWebhookConfiguration if it's a single cluster use, but you should fix the certificate ideally.
I'm not entirely sure what may have happened. Because this wasn't me personally who installed the chart. @anilkumarpasupuleti do u know if we reverted webhookConfiguration/cert Secret during installation?
I just checked the installation instructions now. It's not a straight helm install, is it? Do the CRDs have to be done manually first? Can not doing that cause this issue?
Is this the Secret
we are talking about: https://github.com/aws/eks-charts/blob/master/stable/aws-load-balancer-controller/templates/webhook.yaml#L131-L159 ?
We also checked this somewhat related #1909. Verified that the caBundle in the webhooks matched the Secret
; aws-load-balancer-tls
It's fine to delete the ValidatingWebhookConfiguration if it's a single cluster use, but you should fix the certificate ideally.
Is that validatingwebhookconfigurations/aws-load-balancer-webhook
? I'm no sure what you mean by "single cluster use". We plan on installing aws-load-balancer-controller
on each of our clusters. Wasn't aware a single installation can manage across clusters. (not that we'd need it even so)
This issue does not exist on chart version 1.1.6, its on k8s v1.18
@anilkumarpasupuleti, the validating webhook is included in chart v1.20 (app v2.2.0) and later.
We got the error in chart version 1.20. We tried downgrading and obviously, no validating webhook no errors.
Anil was just mentioning that we're trying all of this in EKS 1.18. (not to be confused with the chart version)
@gazal-k
could you compare the caBundle configured in your validating webhook aws-load-balancer-webhook
and the secret aws-load-balancer-tls
?
caBundle configured in webhook
kubectl get validatingwebhookconfigurations.admissionregistration.k8s.io aws-load-balancer-webhook -ojsonpath={.webhooks[0].clientConfig.caBundle} | base64 -d | openssl x509 -noout -text
CA certificate from the secret
kubectl get secret -n kube-system aws-load-balancer-tls -ojsonpath="{.data.ca\.crt}" |base64 -d |openssl x509 -noout -text
Also verify the webhook certificate is issued by the configured CA
get secret -n kube-system aws-load-balancer-tls -ojsonpath="{.data.tls\.crt}" |base64 -d |openssl x509 -noout -text
I was able to verify installing & updating apps using other ingress controllers using helm
& helmfile
just fine. It seems like a hemlfile
execution from our CI/CD pipeline is where we are seeing the error. This might be a tooling issue. We'll try updating our pipeline containers with newer versions of helm
& helmfile
& see if that helps.
Thanks @kishorj & @M00nF1sh
I have also had this issue when using Polaris https://artifacthub.io/packages/helm/fairwinds-stable/polaris on version 4.0.4
Error: UPGRADE FAILED: cannot patch "polaris" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")
We use cert-manager as detailed with the Polaris documentation and currently running EKS 1.20
@darrenwhighamfd Is this issue still happening in your cluster? have you checked whether the cert stored in secret matches the CA configured in aws-load-balancer-webhook
ValidatingWebhookConfiguration?
Faced with the same issue, with 2 ingress controllers installed on one cluster aws-load-balancer controller and nginx-ingress controller with
We have automated CI with updating every week for new versions from helm, and after some upgrade during creation of ingresses with annotation kubernetes.io/ingress.class: "nginx"
we faced with issue:
aws-load-balancers validation webhook started to handle such requests, with error, related to certificate.
Error: UPGRADE FAILED: cannot patch "kube-prometheus-stack-grafana" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")
Fixed the issue by uninstallation of helm chart aws-load-balancer controller and installing it back again.
Hm, just now I see, that issue spontaneously still exist during deploy of some helm charts with Ingress. Triggering upgrade again, deploy successfully. So issue still exist. Looks like web hook validation do not check annotations such as kubernetes.io/ingress.class or ingressClass in Ingress spec
@M00nF1sh yes we still have this issue, currently we have had to revert to 1.1.6 Similar to @DmitriyStoyanov this issue did not every time happen but spontaneously. One out of every three or so runs. The cert and secret look to match.
I guess that's our experience too. Maybe it wasn't the tooling. It was just that the issue was sporadic. We had verified the certs and secrets matching; https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2071#issuecomment-862869696
ha, found that uninstallation of helm chart, do not remove secrets.
aws-externaldns-token-wkqpc kubernetes.io/service-account-token 3 177d
aws-load-balancer-controller-token-spwhd kubernetes.io/service-account-token 3 177d
So I've tried again
helm uninstall -n kube-system aws-load-balancer-controller
k delete secret -n kube-system aws-load-balancer-controller-token-spwhd
k delete secret -n kube-system aws-externaldns-token-wkqpc
and then install it again, by the way, it created 3 new secrets:
aws-externaldns-token-49mpg kubernetes.io/service-account-token 3 72s
aws-load-balancer-controller-token-rzcvj kubernetes.io/service-account-token 3 86s
aws-load-balancer-tls kubernetes.io/tls 3 49s
possibly it will help. After that deployment of our helm chart with ingress worked fine. Do not know if this fix whole problem or not
@DmitriyStoyanov, @gazal-k, @darrenwhighamfd if you are able to reproduce the issue, could you please create a support ticket with AWS support with cluster ARN?
Same problem here:
Error: UPGRADE FAILED: cannot patch "merged" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")
So I think I found my issue, and possibly what others here are seeing.
@DmitriyStoyanov was close in saying
Looks like web hook validation do not check annotations such as kubernetes.io/ingress.class or ingressClass in Ingress spec
But maybe the key here is that the Ingress spec uses ingressClassName
not ingressClass
.
The documentation https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/ingress/ingress_class/#deprecated-kubernetesioingressclass-annotation looks to be specific to the annotation for alb
itself still being supported, but if you're using a secondary ingress controller (nginx or traefik) the annotation for those will cause the error reported above.
So the "fix" is to update any Ingress objects that are using the annotation style like kubernetes.io/ingress.class: traefik
to using the new IngressClass spec like:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: default-backend-traefik
namespace: ingress
spec:
ingressClassName: traefik
and also define your IngressClass object.
apiVersion: networking.k8s.io/v1beta1
kind: IngressClass
metadata:
name: traefik
spec:
controller: traefik.io/ingress-controller
If I had to guess, the AWS LB controller is not checking the annotations for ingress class, and therefore defaulting your Ingress objects for traefik to be the ALB class and causing the error.
Well, I take that back. I think it was in fact sporadic like others here are reporting. I don't know what the common pattern is though.
For me, we utilize spinnaker + helm template
to generate a single manifest yaml file with all objects in it. This also includes those Ingress objects that are of the traefik/nginx class. If I remove these objects from the manifest, the kubectl apply
succeeds just fine.
I'm at the point where I'm wondering if I need to break my helm chart up and have to separately deploy AWS LB controller (probably should have done this from the start).
Can everyone else confirm that they also only see this when applying a manifest file that includes both the AWS LB controller install AND ingress objects for nginx/traefik?
We use helmfile and specifically helmfile sync
command to manage all applications within a cluster (some nesting of helmfiles). So it's like calling helm upgrade --install
on all charts in the cluster when there's a change. The issue was sporadic for us.
We are currently just using the older version of the chart; 1.1.6. Our plan is to use albc for all our Ingress
es, so we might upgrade once that is done. This might not be a workaround that everyone can use.
It is interesting, today faced with this issue while deployment of helm chart with ingress with
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:xxxxxxx:xxxxxxxx:certificate/xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx
and in logs
Error: UPGRADE FAILED: cannot patch "helm-chart-name" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")
After retry, it is fixed as many times before.
So possible issue not only related to two nginx and alb ingress controllers we have in cluster, but also with aws acm certificate check?
We are using an alb ingress with an acm cert. This error is happening periodically with no configuration changes.
log:
2021/09/16 06:56:31 http: TLS handshake error from 10.0.3.214:37640: remote error: tls: bad certificate
{"level":"error","ts":1631775391.605741,"logger":"controller","msg":"Reconciler error","controller":"service","name":"rabbitmq","namespace":"rabbitmq","error":"Internal error occurred: failed calling webhook \"vtargetgroupbinding.elbv2.k8s.aws\": Post https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"aws-load-balancer-controller-ca\")"}
ingress:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
annotations:
alb.ingress.kubernetes.io/backend-protocol: HTTP
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-2:123456789:certificate/omit
alb.ingress.kubernetes.io/healthcheck-interval-seconds: "30"
alb.ingress.kubernetes.io/healthcheck-port: "15672"
alb.ingress.kubernetes.io/healthcheck-protocol: HTTP
alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "20"
alb.ingress.kubernetes.io/inbound-cidrs: 1.1.1.1/32,omit...
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443},{"HTTP":80}]'
alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=300
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-FS-1-2-2019-08
alb.ingress.kubernetes.io/ssl-redirect: "443"
alb.ingress.kubernetes.io/tags: omit
alb.ingress.kubernetes.io/target-type: ip
external-dns.alpha.kubernetes.io/hostname: rabbitmq.company.com
kubernetes.io/ingress.class: alb
labels:
app.kubernetes.io/instance: rabbitmq
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: rabbitmq
helm.sh/chart: rabbitmq-8.22.1
name: rabbitmq
namespace: rabbitmq
spec:
rules:
- host: rabbitmq.company.com
http:
paths:
- backend:
serviceName: rabbitmq-stats
servicePort: http-stats
path: /*
EDIT: this is not the case, I am still seeing the error. It is intermittent,
Update, I found that when I remove the annotation alb.ingress.kubernetes.io/target-type: ip
this error ceases to occur
Update, I found that when I remove the annotation
alb.ingress.kubernetes.io/target-type: ip
this error ceases to occur
I do not have this annotation at all but still facing the problem.
I use helmfile and it shows diff of certificates on each run, like described in this issue: https://github.com/aws/eks-charts/issues/347
There is non-documented value enableCertManager which is false by default so certificates get regenerated on each run of helmfile apply. I suspect that this may be the reason because if I run helmfile with --selector flag and update just on chart, the error seems not to appear. But when I run helmfile over the whole stack, it appears almost always. It can be that controller is not fast enough to reconfigure the certificates. Just guessing. And to clear up, I have no non-ALB ingresses, all are just ALB.
Update, I found that when I remove the annotation
alb.ingress.kubernetes.io/target-type: ip
this error ceases to occurUpdate, I found that when I remove the annotation
alb.ingress.kubernetes.io/target-type: ip
this error ceases to occurI do not have this annotation at all but still facing the problem.
Hey sorry about the confusion, I am still seeing this error!
I'm also having this issue.
@kishorj @M00nF1sh , any ETA when you fix it?
Would https://github.com/kubernetes-sigs/aws-load-balancer-controller/pull/2264 resolve this issue?
having it also
Ran into this today. Fixed by pruning all related resources and re-installing.
I ran into this today after upgrading the helm deployment from APP version 2.0.1 to 2.2.4. I had to remove the deployment and reinstall.
I just ran into this via the CDK HelmChartProvider. Solution was to try it 3 times... As this is done via Cloudformation the Stack just rolled back and I could do this easily but feels kinda hacky.
I face this issue on a regular basis while applying new/updating nginx ingress configured to use LB via AWS load balancer controller. The only solution I found is to delete all resources of AWS lb controller and reinstall. (I usually do it though argocd, so it's pretty quick reinstallation)
@kishorj I've created a ticket with AWS support. CaseID 9282832861
@kishorj I've created a ticket with AWS support. CaseID 9282832861
Did you have any reply?
Yep, they said they were working on fixing it but couldn't share an ETA.
Hi there, quick data point on this one.
I find that I get this error when I install both the controller and a CRD which uses it in the same helm chart. I.e. my chart has the conroller as a subchart, and a targetgroup binding as a template.
Here is my error:
Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.xxx.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")
If I install the controller first, then install the targetgroupbinding after a delay, it works.
Reasonably confident this is the same as #2239. https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/2e181b2cf31d41ce812deb18da629a5b0631144e/helm/aws-load-balancer-controller/values.yaml#L137
We have set keepTLSSecret: true
. Our GitOps pipeline that uses helmfile sync
has executed a dozen times after the change and we have yet to see this issue.
Reasonably confident this is the same as #2239.
We have set
keepTLSSecret: true
. Our GitOps pipeline that useshelmfile sync
has executed a dozen times after the change and we have yet to see this issue.
Can confirm that this solution also fixed it for me.
We are installing it through helm via ArgoCD. We set keepTLSSecret: True
yet we periodically see the issue
Reasonably confident this is the same as #2239. https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/2e181b2cf31d41ce812deb18da629a5b0631144e/helm/aws-load-balancer-controller/values.yaml#L137
We have set
keepTLSSecret: true
. Our GitOps pipeline that useshelmfile sync
has executed a dozen times after the change and we have yet to see this issue.Can confirm that this solution also fixed it for me.
I tried this, but the issue still keeps periodically popping up too
Today I got the same error on an AWS cluster. I had to remove ns and reinstall everything from scratch.
Still happening. aws-lb-controller version is latest 2.4.1
and keepTLSSecret
is true.
I also ran into this today across multiple clusters. The TLS secret has not changed since deployment as far as I can tell and was not regenerated. Did anyone have any idea whats wrong?
As far as I can tell the TLS secret in the cluster from the helm chart should be valid for 10 years.
Still happening. The controller version is 2.4.3
I've fixed the problem by generating a cert manually and set it via webhookTLS
. Using the certManager integration also works, both fix the problem for me. I've seen a few other charts that generate TLS certs via webhook and all of them have a similar problems
I'm also seeing this on a very regular basis. At least 2-5 times a day.
Environment:
It's happened 164 times and I think it started in August 2022 when I updated the helm chart to 1.4.4. With 1.4.3 or older it only happened once. That means it happened 163 times with a combo of 1.4.4, 1.4.5 and 1.4.6. I'm basing this on looking at the last year's worth of Kubernetes events that match the same error message as this issue.
Still happening, EKS, installed via helm version 1.5.3
I got same error by using kubernetes.helm.sh/v3.Chart
in Pulumi.
kubernetes.helm.v3.Chart(
"lb-dev",
kubernetes.helm.v3.ChartOpts(
chart="aws-load-balancer-controller",
fetch_opts=kubernetes.helm.v3.FetchOpts(
repo="https://aws.github.io/eks-charts"
),
namespace="aws-lb-controller-dev",
values={
"region": "us-west-2",
"serviceAccount": {
"name": "aws-lb-controller-serviceaccount",
"create": False,
},
"vpcId": vpc.vpc_id,
"clusterName": "cluster-dev",
"podLabels": {
"stack": stack,
"app": "aws-lb-controller"
},
"autoDiscoverAwsRegion": "true",
"autoDiscoverAwsVpcID": "true",
"keepTLSSecret": True,
},
), pulumi.ResourceOptions(
provider=eks_provider, parent=namespace
)
)
ingress = kubernetes.networking.v1.Ingress(
f"app-ingress-dev",
metadata=kubernetes.meta.v1.ObjectMetaArgs(
name=f'ingress-dev',
namespace="aws-lb-controller-dev",
annotations={
"kubernetes.io/ingress.class": "alb",
"alb.ingress.kubernetes.io/target-type": "instance",
"alb.ingress.kubernetes.io/scheme": "internet-facing"
},
labels={'app': 'ingress-dev'},
),
spec=kubernetes.networking.v1.IngressSpecArgs(
rules=[kubernetes.networking.v1.IngressRuleArgs(
host='example.com',
http=kubernetes.networking.v1.HTTPIngressRuleValueArgs(
paths=[
kubernetes.networking.v1.HTTPIngressPathArgs(
path="/app1",
path_type="Prefix",
backend=kubernetes.networking.v1.IngressBackendArgs(
service=kubernetes.networking.v1.IngressServiceBackendArgs(
name=svc_01.metadata.name,
port=kubernetes.networking.v1.ServiceBackendPortArgs(
number=80,
),
),
),
),
],
),
)],
),
opts=pulumi.ResourceOptions(provider=eks_provider)
)
Info
kubectl describe ingress ingress-dev -n aws-lb-controller-dev
Name: ingress-name-dev
Labels: app=ingress-name-dev
app.kubernetes.io/managed-by=pulumi
Namespace: aws-lb-controller-dev
Address: k8s-awslbcon-ingressn-63280fface-152845941.us-west-2.elb.amazonaws.com
Ingress Class: <none>
Default backend: <default>
Rules:
Host Path Backends
---- ---- --------
example.com
/app1 eks-service-dev-443996a5:80 (10.0.18.89:80)
Annotations: alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: instance
kubernetes.io/ingress.class: alb
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedDeployModel 23m (x14 over 24m) ingress Failed deploy model due to Internal error occurred: failed calling webhook "mtargetgroupbinding.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.aws-lb-controller-dev.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")
Normal SuccessfullyReconciled 22m (x3 over 3h7m) ingress Successfully reconciled
We are trying to migrate from
ingress-nginx
toaws-load-balancer-controller
. We are starting by just installing the controller chart. The plan is to template our applications to use the new ingress.classalb
and then migrate them.But after installing
aws-load-balancer-controller
, we are seeing errors on our existing applications like:cannot patch "app1-ingress" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca"): cannot patch "app1-ingress" with kind Ingress: Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")
app1-ingress
still useskubernetes.io/ingress.class: nginx
. Can we skip the webhook from modifying those?