Closed thanhma closed 1 year ago
@thanhma, do you use auto-discovery? If so, does the controller resume once you import new certificate for the domain under question?
@kishorj
Yes, I use certificate auto-discovery. But if the certificate renewal takes time, or is not able to renew, it will affect reconciliation of other ingresses in the group.
This also fails if the certificate ARN that is set on the ingress does not exists.
How to test:
alb.ingress.kubernetes.io/certificate-arn: .....
{"level":"error","ts":1637353847.5984075,"logger":"controller","msg":"Reconciler error","controller":"ingress","name":"blox-presto","namespace":"prd2150","error":"CertificateNotFound: Certificate 'arn:aws:acm:us-east-1:-REDACTED-:certificate/4REDACTEDdc' not found\n\tstatus code: 400, request id: XXXXX"}
And I think is not related to ACM certificates, if any malformed ingress gets created in the cluster, the controller operation just halts.
This seems to be a regression, the alb-ingress-controller suffered from this and got fixed.
Let me know If I can help testing etc, @M00nF1sh tnx!.
@jescarri I don't think it's a regression since if the controller is behavior is always stop reconciliation if Ingress configuration is invalid. I'm assuming you are using IngressGroup, and the current behavior is if a single Ingress within IngressGroup contains invalid configuration, the entire IngressGroup will stop reconciliation, but other IngressGroups will be un-impacted(it would be a regression if other IngressGroups are impacted)
We do have plans to optimize this in the future, see https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2349
hrrm we are not using ingress-groups yet, so I guess all ingresses in the cluster are part of the default ingressGroup?
@jescarri No, by default ingresses didn't belong to any IngressGroup. So in your case a single non-exists cert prevents other ingress from reconcile? Let me do a test to confirm it.
@M00nF1sh yep and not only the certs, is any invalid alb setting like subnets or certs, timeouts etc.
@jescarri I just tested it and cannot reproduce the issue. Are you on kubernetes slack? if so, you can find me on @M00nF1sh and we can live debug there
hey @M00nF1sh sure, let me set up something, I will ping you there.
tnx!
hey @M00nF1sh we are still experiencing this, its a weird situation that takes time to develop.
We have seen it happen when this conditions happen:
Hope this helps!
hey @M00nF1sh I've made a new discovery, if the certificate exists but it has failed renewal.
The controller says the certificate does not exists/is not found, instead of just continuing the work.
Hi,
We've run into a similar issue as well.
We currently have 2 ingresses Ingress-1 has the following annotations
alb.ingress.kubernetes.io/backend-protocol-version: GRPC
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
alb.ingress.kubernetes.io/target-type: ip
Ingress-2 has the following annotations
alb.ingress.kubernetes.io/target-type: ip
Certificate for Ingress-1 is not uploaded to ACM.
Steps:
Create Ingress-1 Ingress object is created, load balancer is not assigned.. See a bunch of error in aws load balancer controller which are expected since the cert is not found
{"level":"error","ts":1648593364.4037778,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}
{"level":"error","ts":1648593364.5672228,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}
{"level":"error","ts":1648593364.8876278,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}
Create Ingress-2 (same ingressclass/group) Ingress object is created, but load balance is not assinged even though there is no cert dependency
So basically any ingresses(same ingressclass/group) which are created after a problematic ingress are getting blocked. We can't even delete older working ingress which were created before the problematic ingress until unless we delete the problematic ingress.
This is a blocker for us. Can this issue be looked into soon?
Ideally ALB Controller should "ignore/ log appropriate errors" for ingresses with issues and continue to reconcile other add/update/delete ingress actions.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
is there any to manually delete the old config to get rid of error ? i had to use new alb gorup just because one one is stuck in error state and there is no way to delete it
We're seeing this too in our clusters running v2.4.6
... what happened is someone deleted the cert referenced an ingress and the reconcile loop for the entire group then stops at that error, leaving other healthy ingress un-reconciled.
Of course the work-around is to not delete the referenced cert until the ingress is deleted.
This is nuts. One mistaken deployment and the entire cluster is fubar'd (no deploy or undeploy of anything that contains an Ingress). You can't even delete the broken Ingress object without extra actions. You also cannot detect this problem without doing a human review of the aws-load-balancer-controller logs.
FYI, once you have a stuck Ingress object (one without a valid cert), here's how you get rid of it:
kubectl patch -n <namespace> ingress <ingress-name> -p '{"metadata":{"finalizers":null}}' --type=merge
kubectl delete -n <namespace> ingress <ingress-name>
... and if your namespace is failing to delete due to this problem, you can list the resources hanging on namespace delete with this (slow):
kubectl api-resources --verbs=list --namespaced -o name \
| xargs -n 1 kubectl get -n <namespace> --show-kind --ignore-not-found
... which will show an ingress object if the above is your problem.
Can we maybe get a config options (at controller-level or ingres-level) that causes it to "ignore ingress if certificate is missing"? ... but I would argue that should be the default behavior.
Is your feature request related to a problem? I have been using ingress group to group about 30 ingresses in a single ALB. Each ingress has its own SSL certificate that was imported into ACM.
When a certificate is expired and for some reasons, we are unable to renew it, ALB Controller starts failing to reconcile the whole ingress group with logs like this:
This stops us to create new ingresses, or even delete old ones, since the ALB will not be updated. We must delete the ingress with expired certificate manually, or renew the certificate.
Describe the solution you'd like ALB Controller can ignore ingress with certificate failure and continue to reconcile other add/update/delete ingress actions.
Describe alternatives you've considered None