kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.89k stars 1.44k forks source link

Sporadic Fixed Error 404''s due to Ingress Controller Using ModifyRule API #3816

Open dieherna opened 3 weeks ago

dieherna commented 3 weeks ago

Describe the bug Observing sporadic fixed 404 errors on an EKS cluster hosting multiple container services. The services are fronted by a single ALB using host based routing. All services have a unique FQDN, there are no overlapping domains. The services are configured on the ALB via ingress controller using ingress groups. During the deployment of a new service version, we will deploy the new service in parallel to the existing service. Shortly after (few days) we will delete the old service version. When the old service version is deleted, the ALB will respond with fixed 404 responses for a few seconds before recovering. All changes on the ALB are exclusively done via the ingress controller. Never manually.

[ROOT CAUSE] The outcome troubleshooting with the AWS ALB service team is that the issue is being caused by a race condition. Have been told by the service team that the race condition can be avoided by the ingress controller using the SetRulePriorities API instead of recreating the functionality through a lot of ModifyRule API. To summarize the events that lead to the race condition:

Scenario: Customer has 30 services deployed. All with unique FQDN's. Single ALB being managed by ingress controller ingress groups. Single ALB listener with 30 rules (1 per service)

  1. New service version is deployed (service-a-v2.xyz.com) and running in parallel to existing service (service-a-v1.xyz.com). service-a-v1.xyz.com has an ALB rule priority of 3.
  2. service-a-v1.xyz.com is deleted. The ingress controller will send an DeleteRule API to the ALB. ALB rules remaining have a priority of 1-2 & 4-30
    1. Implicitly (under the hood behaviour) the ingress controller will send 27 ModifyAPI requests to the ALB. The ModifyAPI's are trying to keep the rules priorities sequential (1-29). The ModifyAPI will replace the FQDN of the rule above with the FQDN of the rule below. Ultimately resulting in all the FQDN's priority being shifted up by 1.
    2. The ALB implements changes in batches. When the first API is received, the ALB will wait 10 seconds to batch all possible subsequent changes before deploying the change in the data plane. This could lead to the race condition where some of the ModifyAPI's arrive after the first API's and be implemented in the next window. A random rule for a short period of time does not exist in the ALB. e.g., service-z.xyz.com responds with fixed 404's.

Steps to reproduce

  1. Create multiple services with ingress.group e.g 3 services
  2. delete a service, preferably one which has been deployed in the ALB with the highest rule priority
    1. The order will be done using lexical order by default
    2. The higher the rule priority number, I suspect it is more likely to trigger the issue as more ModifyRule API's are created
  3. watch all the API calls being made via cloudtrail
  4. watch the HTTPCode_ELB_4XX_Count ALB cloudwatch metric for any hits
    1. The issue is sporadic, it is not seen every time. May require many attempts

Expected outcome Expect for every time a service is removed that a different service is impacted. Meaning that some other random service will not begin to result in the ALB responding to fixed response 404's due to the corresponding FQDN not being programmed on the ALB for a very short period of time.

Environment

Additional Context: We have looked at setting alb.ingress.kubernetes.io/group.order, behaviour is the same as described under root cause. group.order does not honour the exact number specified. It will apply sequential rule order priority numbers relative to the order number specified

shraddhabang commented 3 weeks ago

Thank you for bringing this to our attention. I think we can definitely do some improvements here for management of the rules so that update to one does not affect other.