kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.93k stars 1.46k forks source link

ALB ingress reconcile stops processing at first failure #1332

Closed chancez closed 3 years ago

chancez commented 4 years ago

If you create an ingress resource with a mix of existing, and non-existing services, then it will stop the reconciliation after the first failure to lookup the non-existent service, causing the ingress to be partially configured.

This means if you have an ingress with a dozen or so rules, and the first rule references a service that doesn't exist, then you will have a loadbalancer with no listeners and no rules.

Ideally, it would skip over the missing service, and create a route to the default backend instead, and continue configuring the rest of the listeners and rules for the loadbalancer.

Currently as it is a misconfiguration or a backend service being deleted (for any reason) could lead to an ingress ignoring all updates to the rules, which could lead to unexpected results. When using a single ingress with multiple services, this is very undesirable.

For example, the following is what I'm seeing for one of my ingresses:

2020-07-27T09:54:29-07:00 I0727 16:54:29.387701       1 security_group.go:36] example-dev/example-gateway: creating securityGroup 665168b2-exampledev-examplegateway-1f77:managed LoadBalancer securityGroup by ALB Ingress Controller
2020-07-27T09:54:29-07:00 I0727 16:54:29.523949       1 tags.go:69] example-dev/example-gateway: modifying tags {  ingress.k8s.aws/resource: "ManagedLBSecurityGroup",  kubernetes.io/cluster-name: "dev-2447-us-west-2",  kubernetes.io/namespace: "example-dev",  kubernetes.io/ingress-name: "example-gateway",  ingress.k8s.aws/cluster: "dev-2447-us-west-2",  ingress.k8s.aws/stack: "example-dev/example-gateway"} on sg-07609fc42cbb5751c
2020-07-27T09:54:29-07:00 I0727 16:54:29.654996       1 security_group.go:75] example-dev/example-gateway: granting inbound permissions to securityGroup sg-07609fc42cbb5751c: [{    FromPort: 443,    IpProtocol: "tcp",    IpRanges: [{        CidrIp: "0.0.0.0/0",        Description: "Allow ingress on port 443 from 0.0.0.0/0"      }],    ToPort: 443  }]
2020-07-27T09:54:29-07:00 I0727 16:54:29.839468       1 loadbalancer.go:194] example-dev/example-gateway: creating LoadBalancer 665168b2-exampledev-examplegateway-1f77
2020-07-27T09:54:30-07:00 I0727 16:54:30.404545       1 loadbalancer.go:211] example-dev/example-gateway: LoadBalancer 665168b2-exampledev-examplegateway-1f77 created, ARN: arn:aws:elasticloadbalancing:us-west-2:117923182973:loadbalancer/app/665168b2-exampledev-examplegateway-1f77/b2c7649aa487c76a
2020-07-27T09:54:31-07:00 I0727 16:54:31.085131       1 targetgroup.go:119] example-dev/example-gateway: creating target group 665168b2-b1ea9f3f0e0c8391a09
2020-07-27T09:54:31-07:00 I0727 16:54:31.248633       1 targetgroup.go:138] example-dev/example-gateway: target group 665168b2-b1ea9f3f0e0c8391a09 created: arn:aws:elasticloadbalancing:us-west-2:117923182973:targetgroup/665168b2-b1ea9f3f0e0c8391a09/5975c410a2334bc4
2020-07-27T09:54:31-07:00 I0727 16:54:31.270235       1 tags.go:43] example-dev/example-gateway: modifying tags {  kubernetes.io/service-port: "8000",  ingress.k8s.aws/resource: "example-dev/example-gateway-example-compute-service:8000",  kubernetes.io/cluster/dev-2447-us-west-2: "owned",  kubernetes.io/namespace: "example-dev",  kubernetes.io/ingress-name: "example-gateway",  ingress.k8s.aws/cluster: "dev-2447-us-west-2",  ingress.k8s.aws/stack: "example-dev/example-gateway",  kubernetes.io/service-name: "example-compute-service"} on arn:aws:elasticloadbalancing:us-west-2:117923182973:targetgroup/665168b2-b1ea9f3f0e0c8391a09/5975c410a2334bc4
2020-07-27T09:54:31-07:00 E0727 16:54:31.319585       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to reconcile targetGroups due to failed to reconcile targetGroup targets due to Unable to find the example-dev/example-compute-service service: no object matching key \"example-dev/example-compute-service\" in local store"  "controller"="alb-ingress-controller" "request"={"Namespace":"example-dev","Name":"example-gateway"}
2020-07-27T09:54:32-07:00 E0727 16:54:32.579188       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to reconcile targetGroups due to failed to reconcile targetGroup targets due to Unable to find the example-dev/example-compute-service service: no object matching key \"example-dev/example-compute-service\" in local store"  "controller"="alb-ingress-controller" "request"={"Namespace":"example-dev","Name":"example-gateway"}
chancez commented 4 years ago

Also, in these circumstances, failures are not surfaced to events (this is the only event, nothing about listeners/rules/missing service backends):

Events:
  Type    Reason  Age    From                    Message
  ----    ------  ----   ----                    -------
  Normal  CREATE  5m37s  alb-ingress-controller  LoadBalancer 665168b2-exampledev-examplegateway-1f77 created, ARN: arn:aws:elasticloadbalancing:us-west-2:117923182973:loadbalancer/app/665168b2-exampledev-examplegateway-1f77/b2c7649aa487c76a
chancez commented 4 years ago

Seems to be the same as https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/1304

M00nF1sh commented 4 years ago

Hi, I think we should add events to the Ingress object about such errors. However, I don't think we should ignore rules when the service is missing. The rules can have dependencies, and the Ingress should work as a whole. e.g.

/auth   -> service-missing
/*  -> service-exists

If we ignore the first "/auth" rule, it changes the application behavior completely. /auth will be routed to service-exists

chancez commented 4 years ago

@M00nF1sh You don't have to do it like that though. I'm not saying to ignore the rule, but to ignore the missing service. You can still define the route rule in the ALB, you just have to point it at something like a default backend or even just point it at a non-existent nodePort, or something.

larosek commented 4 years ago

We encountered a similar issue today that had the same end result but for a different reason. We had a bad certificate configuration: alb.ingress.kubernetes.io/certificate-arn: ${ssl_certificate_alb},${secondary_certificate}

It could not reconcile properly since the secondary_certificate was just a bad variable and was not deployed in the region where this was running. Once you encounter this error it stops and does not apply any of the rules.

I0807 02:08:35.666873       1 listener.go:185] example-namespace/example-app: adding certificate arn:aws:acm:us-east-1:blabla:certificate/guid to listener arn:aws:elasticloadbalancing:ap-southeast-2:blabla:listener/app/myawesomeapp
E0807 02:08:35.749797       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to reconcile listeners due to failed to reconcile extra certificates on listener arn:aws:elasticloadbalancing:ap-southeast-2:blabla:listener/app/myawesomeapp: CertificateNotFound: Certificate 'arn:aws:acm:us-east-1:blabla:certificate/guid' not found\n\tstatus code: 400, request id: guid"  "controller"="alb-ingress-controller" "request"={"Namespace":"example-namespace","Name":"example-app"}

You could have potato.awesomecorp.com -> /potato and banana.awesomecorp.com -> /banana. If one of the certificate is invalid you are essentially breaking everything even if both are most likely unrelated.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

chancez commented 3 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1332#issuecomment-813089441): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
EdwinPhilip commented 2 years ago

/reopen

k8s-ci-robot commented 2 years ago

@EdwinPhilip: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/1332#issuecomment-1091639124): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
yiuc commented 1 year ago

look like the problem still exist. Should I assume it is expected behavior?