kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.92k stars 1.46k forks source link

AWS Load Balancer Controller keeps trying to reconcile the ALB with an non-existent WAF without alb.ingress.kubernetes.io/wafv2-acl-arn annotation on ingress resource #3482

Closed maiconrocha closed 5 months ago

maiconrocha commented 11 months ago

Describe the bug

AWS Load Balancer Controller keeps trying to reconcile waf-acl association to the provisioned ALB with the ingress, even though the WAFv2 and WAFv2 Web-Acls have been deleted and annotation alb.ingress.kubernetes.io/wafv2-acl-arn removed from the Ingress Resource.

Steps to reproduce

I am a CSE from AWS Support, I will explain the problem customer was having and steps I am trying to reproduce the issue but with no luck so far,

Customer explained that WAFv2 and WAFv2 Web-Acls have been deleted from their AWS Accounts weeks ago, and previously ingress resources for their EKS Cluster have the following annotation:

# alb.ingress.kubernetes.io/wafv2-acl-arn: arn:aws:wafv2:us-east-1:XXXXXXX:regional/webacl/xxx-waf-xxxx/xxxxxxx-xxxx-xxxxx-xxxx-xxxxxxxxxx

After commenting out the annotation for their ingress resources, customer noticed the following error on ingress, which was continuously being faced:


   Events:
  Type     Reason             Age                     From     Message
  ----     ------             ----                    ----     -------
  Warning  FailedDeployModel  7m39s (x665 over 6d9h)  ingress  Failed deploy model due to failed to create WAFv2 webACL association on LoadBalancer: WAFNonexistentItemException: AWS WAF couldn’t perform the operation because your resource doesn’t exist.

During our call we were not able to find the root cause, but we are able to find a mitigation for this error which was to upgrade AWS Load Balancer Controller to the latest version available: 2.6.2. Since version v2.5.3, AWS Load Balancer controller supports disabling the reconciliation process for waf-acl association to the provisioned ALB with the ingress. Once disabled, the controller shall not take any actions on the waf addons of the provisioned ALBs.

We performed the following steps to upgrade the AWS Load Balancer Controller to the latest version and add the flags to disabling the reconciliation process for waf-acl association:

Step 1: Update AWSLoadBalancerControllerIAMPolicy to the latest version available: https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.6.2/docs/install/iam_policy.json Step 2: Upgrade AWS Load Balancer Controller using Helm Upgrade Command:

 helm upgrade aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=my-cluster \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller
  --set region=region-code
  --set vpcId=vpc-xxxxxxxx 

Step 3: Add the flags below on AWS Load Balancer Controler Deployment spec.template.spec.containers.args

  - --enable-waf=false
  - --enable-wafv2=false

Once the flags have been added, we confirmed Ingress Resources showed Reconcilition succeed.

ingress Successfully reconciled However, we tested removing the flags and the problem started again:

ingress Failed deploy model due to failed to create WAFv2 webACL association on LoadBalancer: WAFNonexistentItemException: AWS WAF couldn’t perform the operation because your resource doesn’t exist. Even though there is no more annotation on the ingress for alb.ingress.kubernetes.io/wafv2-acl-arn

Steps I tried to reproduce the issue:

Step 1: Install AWS load balancer Controller on version 2.6.2

Step 2: Install minimal ingress with the annotation alb.ingress.kubernetes.io/wafv2-acl-arn for an non-existent WAF

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: minimal-ingress
  namespace: fargate
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    alb.ingress.kubernetes.io/wafv2-acl-arn: arn:aws:wafv2:ap-southeast-2:XXXXXXXXXX:regional/webacl/fake-waf/XXXXX-XXXXX-XXX-XXXX-XXXXX
spec:
  ingressClassName: alb
  rules:
  - http:
      paths:
      - path: /testpath
        pathType: Prefix
        backend:
          service:
            name: test
            port:
              number: 80

And I could see the error:

Warning FailedDeployModel 15s ingress Failed deploy model due to failed to create WAFv2 webACL association on LoadBalancer: WAFNonexistentItemException: AWS WAF couldn’t perform the operation because your resource doesn’t exist. Step 3: Then I removed the annotation from the ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: minimal-ingress
  namespace: fargate
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  ingressClassName: alb
  rules:
  - http:
      paths:
      - path: /testpath
        pathType: Prefix
        backend:
          service:
            name: test
            port:
              number: 80

And I could see the reconcilitation succeed

Normal SuccessfullyReconciled 3s (x2 over 6m21s) ingress Successfully reconciled

Customer has ingress running on fargate, I tried the above in either a fargate namespace and default namespace and got the same results. This is not specific to a particular EKS Cluster or AWS Environment, given the customer is facing this problem across their dev, stage and prod eks Clusters which runs across different AWS Accounts.

I would appreciate if anyone from AWS Load Balancer Controller team can provide any recommendation about why on the customer environment AWS Load Balancer Controller keeps trying to reconcile the ALB with an non-existent WAF and also without alb.ingress.kubernetes.io/wafv2-acl-arn annotation on ingress resource?

Expected outcome Given there is no annotation alb.ingress.kubernetes.io/wafv2-acl-arn on ingress resource anymore, we would expected that without using the below flags on AWS Load Balancer Controler Deployment spec.template.spec.containers.args

Environment

oliviassss commented 11 months ago

@maiconrocha, would you be able to share the logs when cx hit this issue in the older version? You can send out to me internally if not able to share in the gh. Thanks.

maiconrocha commented 11 months ago

Thanks @oliviassss I don't have the full logs, I asked the customer to monitor this thread so they can provide any further logs if required. Are you interested in AWS Load Balancer Controller logs or something else? please note, customer is also facing the issue on the latest version 2.6.2 when we removed the flags

ingress Failed deploy model due to failed to create WAFv2 webACL association on LoadBalancer: WAFNonexistentItemException: AWS WAF couldn’t perform the operation because your resource doesn’t exist.

Once the flags have been re-added, we confirmed Ingress Resources showed Reconcilition succeed.

ingress Successfully reconciled

oliviassss commented 11 months ago

@maiconrocha, I'll try to reproduce, yes the controller logs will also help us to further investigate. Thanks

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 5 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3482#issuecomment-2081730327): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.