Ingress group stuck with one ingress certificate error

thanhma commented 3 years ago

Is your feature request related to a problem? I have been using ingress group to group about 30 ingresses in a single ALB. Each ingress has its own SSL certificate that was imported into ACM.

When a certificate is expired and for some reasons, we are unable to renew it, ALB Controller starts failing to reconcile the whole ingress group with logs like this:

{"level":"error","ts":1632213733.1759012,"logger":"controller","msg":"Reconciler error","controller":"ingress","name":"cluster-group-1","namespace":"","error":"ingress: group-1/expired-cert-ingress: none certificate found for host: expired-domain.net"}

This stops us to create new ingresses, or even delete old ones, since the ALB will not be updated. We must delete the ingress with expired certificate manually, or renew the certificate.

Describe the solution you'd like ALB Controller can ignore ingress with certificate failure and continue to reconcile other add/update/delete ingress actions.

Describe alternatives you've considered None

kishorj commented 3 years ago

@thanhma, do you use auto-discovery? If so, does the controller resume once you import new certificate for the domain under question?

thanhma commented 3 years ago

@kishorj

Yes, I use certificate auto-discovery. But if the certificate renewal takes time, or is not able to renew, it will affect reconciliation of other ingresses in the group.

jescarri commented 2 years ago

This also fails if the certificate ARN that is set on the ingress does not exists.

How to test:

Create an ingress and set the annotation for an non existent ACM Cert

alb.ingress.kubernetes.io/certificate-arn: .....

Controller logs will look like:

{"level":"error","ts":1637353847.5984075,"logger":"controller","msg":"Reconciler error","controller":"ingress","name":"blox-presto","namespace":"prd2150","error":"CertificateNotFound: Certificate 'arn:aws:acm:us-east-1:-REDACTED-:certificate/4REDACTEDdc' not found\n\tstatus code: 400, request id: XXXXX"}

The aws-load-balancer controller stops processing ANY updates to new or existing ingresses, we are running:
aws-load-balancer-controller:v2.2.4

jescarri commented 2 years ago

And I think is not related to ACM certificates, if any malformed ingress gets created in the cluster, the controller operation just halts.

This seems to be a regression, the alb-ingress-controller suffered from this and got fixed.

Let me know If I can help testing etc, @M00nF1sh tnx!.

M00nF1sh commented 2 years ago

@jescarri I don't think it's a regression since if the controller is behavior is always stop reconciliation if Ingress configuration is invalid. I'm assuming you are using IngressGroup, and the current behavior is if a single Ingress within IngressGroup contains invalid configuration, the entire IngressGroup will stop reconciliation, but other IngressGroups will be un-impacted(it would be a regression if other IngressGroups are impacted)

We do have plans to optimize this in the future, see https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2349

jescarri commented 2 years ago

hrrm we are not using ingress-groups yet, so I guess all ingresses in the cluster are part of the default ingressGroup?

M00nF1sh commented 2 years ago

@jescarri No, by default ingresses didn't belong to any IngressGroup. So in your case a single non-exists cert prevents other ingress from reconcile? Let me do a test to confirm it.

jescarri commented 2 years ago

@M00nF1sh yep and not only the certs, is any invalid alb setting like subnets or certs, timeouts etc.

M00nF1sh commented 2 years ago

@jescarri I just tested it and cannot reproduce the issue. Are you on kubernetes slack? if so, you can find me on @M00nF1sh and we can live debug there

jescarri commented 2 years ago

hey @M00nF1sh sure, let me set up something, I will ping you there.

tnx!

jescarri commented 2 years ago

hey @M00nF1sh we are still experiencing this, its a weird situation that takes time to develop.

We have seen it happen when this conditions happen:

An ingress is created without certificate.
That same ingress is updated with an invalid certificate ( Updates will fail for this ingress which is ok).
Worker nodes are replaced ( this can be gradually ).
Some previously healthy ingresses/albs stop being updated with the new worker nodes ( at some point the TG become empty).

Hope this helps!

jescarri commented 2 years ago

hey @M00nF1sh I've made a new discovery, if the certificate exists but it has failed renewal.

The controller says the certificate does not exists/is not found, instead of just continuing the work.

vasu-git commented 2 years ago

Hi,

We've run into a similar issue as well.

We currently have 2 ingresses Ingress-1 has the following annotations

   alb.ingress.kubernetes.io/backend-protocol-version: GRPC
   alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
   alb.ingress.kubernetes.io/target-type: ip

Ingress-2 has the following annotations alb.ingress.kubernetes.io/target-type: ip

Certificate for Ingress-1 is not uploaded to ACM.

Steps:

Create Ingress-1 Ingress object is created, load balancer is not assigned.. See a bunch of error in aws load balancer controller which are expected since the cert is not found

{"level":"error","ts":1648593364.4037778,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}
{"level":"error","ts":1648593364.5672228,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}
{"level":"error","ts":1648593364.8876278,"logger":"controller-runtime.manager.controller.ingress","msg":"Reconciler error","name":"corp","namespace":"","error":"ingress: prometheus-operator/thanos-query-grpc: none certificate found for host: thanos-grpc.dev-test8-labawsuse.lab.ppops.net"}

Create Ingress-2 (same ingressclass/group) Ingress object is created, but load balance is not assinged even though there is no cert dependency

So basically any ingresses(same ingressclass/group) which are created after a problematic ingress are getting blocked. We can't even delete older working ingress which were created before the problematic ingress until unless we delete the problematic ingress.

This is a blocker for us. Can this issue be looked into soon?

Ideally ALB Controller should "ignore/ log appropriate errors" for ingresses with issues and continue to reconcile other add/update/delete ingress actions.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

M00nF1sh commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2241#issuecomment-1328302886): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

mk2134226 commented 1 year ago

is there any to manually delete the old config to get rid of error ? i had to use new alb gorup just because one one is stuck in error state and there is no way to delete it

thelabdude commented 1 year ago

We're seeing this too in our clusters running v2.4.6 ... what happened is someone deleted the cert referenced an ingress and the reconcile loop for the entire group then stops at that error, leaving other healthy ingress un-reconciled.

Of course the work-around is to not delete the referenced cert until the ingress is deleted.

ryankenney-dev commented 2 months ago

This is nuts. One mistaken deployment and the entire cluster is fubar'd (no deploy or undeploy of anything that contains an Ingress). You can't even delete the broken Ingress object without extra actions. You also cannot detect this problem without doing a human review of the aws-load-balancer-controller logs.

FYI, once you have a stuck Ingress object (one without a valid cert), here's how you get rid of it:

kubectl patch -n <namespace> ingress <ingress-name> -p '{"metadata":{"finalizers":null}}' --type=merge
kubectl delete -n <namespace> ingress <ingress-name>

... and if your namespace is failing to delete due to this problem, you can list the resources hanging on namespace delete with this (slow):

kubectl api-resources --verbs=list --namespaced -o name \
    | xargs -n 1 kubectl get -n <namespace> --show-kind --ignore-not-found

... which will show an ingress object if the above is your problem.

Can we maybe get a config options (at controller-level or ingres-level) that causes it to "ignore ingress if certificate is missing"? ... but I would argue that should be the default behavior.

kubernetes-sigs / aws-load-balancer-controller

Ingress group stuck with one ingress certificate error #2241