kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.84k stars 1.42k forks source link

Error getting v2 to work after upgrading eks cluster to 1.20 #2208

Closed hetpats closed 2 years ago

hetpats commented 2 years ago

Describe the bug ERROR:- Error from server (InternalError): error when creating "ingress.yaml": Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s ": context deadline exceeded

ENV:- Client Version: v1.19.6-eks-49a6c0 Server Version: v1.20.7-eks-d88609 aws-load-balancer-controller version tried 2.2.1 to 2.2.4 IAM policy is designed per document, CRD in place before deploying . Both YAML and Helm deployments tried with same error.

URLS:- https://github.com/kubernetes-sigs/aws-load-balancer-controller https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/installation/

We upgraded our EKS cluster to 1.20 from 1.17 and are in process of upgrading the K8s on the cluster. We use aws-load-balancer-conroller version 1.1.9 and wanted to upgrade it to 2.2.4 , we followed the git doc located here https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/installation/

Steps Followed:- 1) uninstall old alb controller 2) install crd and cert-manager 3) installed new alb controller 4) Tried to create ingress

however , now we cannot create any ingress objects and are getting error as shown below:

Error from server (InternalError): error when creating "ingress.yaml": Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s ": context deadline exceeded

our existing ingress logs on cluster are showing this error :

Warning FailedAddFinalizer 12m (x21 over 74m) ingress Failed add finalizer due to Internal error occurred: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s ": context deadline exceeded

logs from ALB CONTROLLER:- │ {"level":"debug","ts":1630609624.7311962,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"Ingress","namespace":"monitoring","name":"monitoring-ingress","uid":"ca008baa-bfcc-4a5a-bf37-2f2ee821bb24","apiVersion":"networking.k8s.io/ │

We validatingwebhookconfigurations.admissionregistration.k8s.io aws-load-balancer-webhook & aws-load-balancer-tls secret is matching the certs and the cert is issued by Issuer: CN=aws-load-balancer-controller-ca Pods for ALB controller are running state and service is up , Security group is open for 443 and 4443 between api and worker nodes. Verified that certificate is valid and present in the secret. Tried both YAML and HELM installation.

Steps to reproduce upgrade cluster from 1.17 to 1.20 following AWS documentation. https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/installation/ 1) uninstall old alb controller 2) install crd and cert-manager 3) installed new alb controller

Expected outcome should be able to deploy ingress object without validtionhook error.

Environment UAT

Additional Context: logs from alb controller is as follow... Controller","reconcilerGroup":"elbv2.k8s.aws","reconcilerKind":"TargetGroupBinding","controller":"targetGroupBinding"} │ │ {"level":"info","ts":1630609564.6970057,"logger":"controller","msg":"Starting Controller","controller":"ingress"} │ │ {"level":"info","ts":1630609564.6970384,"logger":"controller","msg":"Starting workers","controller":"ingress","worker count":3} │ │ {"level":"info","ts":1630609564.6970139,"logger":"controller","msg":"Starting workers","reconcilerGroup":"elbv2.k8s.aws","reconcilerKind":"TargetGroupBinding","controller":"targetGroupBinding","worker count":3} │ │ {"level":"error","ts":1630609584.7039437,"logger":"controller","msg":"Reconciler error","controller":"ingress","name":"monitoring-ingress","namespace":"monitoring","error":"Internal error occurred: failed calling webhook \"vingress.elbv2.k8s.aws\": Post \"htt │ │ {"level":"debug","ts":1630609584.704016,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"Ingress","namespace":"monitoring","name":"monitoring-ingress","uid":"ca008baa-bfcc-4a5a-bf37-2f2ee821bb24","apiVersion":"networking.k8s.io/v │ │ {"level":"error","ts":1630609604.7151241,"logger":"controller","msg":"Reconciler error","controller":"ingress","name":"monitoring-ingress","namespace":"monitoring","error":"Internal error occurred: failed calling webhook \"vingress.elbv2.k8s.aws\": Post \"htt │ │ {"level":"debug","ts":1630609604.7151897,"logger":"controller-runtime.manager.events","msg":"Warning","object":{"kind":"Ingress","namespace":"monitoring","name":"monitoring-ingress","uid":"ca008baa-bfcc-4a5a-bf37-2f2ee821bb24","apiVersion":"networking.k8s.io/ │ │ {"level":"error","ts":1630609624.73112,"logger":"controller","msg":"Reconciler error","controller":"ingress","name":"monitoring-ingress","namespace":"monitoring","error":"Internal error occurred: failed calling webhook \"vingress.elbv2.k8s.aws\": Post \"https │

LOGS from API control plane:- 2021-09-02T17:15:22.000-04:00CopyW0902 21:15:22.663603 1 dispatcher.go:134] Failed calling webhook, failing closed vingress.elbv2.k8s.aws: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": context deadline exceeded W0902 21:15:22.663603 1 dispatcher.go:134] Failed calling webhook, failing closed vingress.elbv2.k8s.aws: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": context deadline exceeded
  2021-09-02T17:15:22.000-04:00 I0902 21:15:22.665800 1 dialer.go:35] [96df57175088ac5: 10.212.128.231:9443] Dialing...
  2021-09-02T17:15:32.000-04:00CopyI0902 21:15:32.665756 1 trace.go:205] Trace[287967607]: "Call validating webhook" configuration:aws-load-balancer-webhook,webhook:vingress.elbv2.k8s.aws,resource:networking.k8s.io/v1beta1, Resource=ingresses,subresource:,operation:UPDATE,UID:615e463a-ba2d-4280-add2-ef7ef30f6c97 (02-Sep-2021 21:15:22.665) (total time: 10000ms):Trace[287967607]: [10.000136234s] [10.000136234s] END I0902 21:15:32.665756 1 trace.go:205] Trace[287967607]: "Call validating webhook" configuration:aws-load-balancer-webhook,webhook:vingress.elbv2.k8s.aws,resource:networking.k8s.io/v1beta1, Resource=ingresses,subresource:,operation:UPDATE,UID:615e463a-ba2d-4280-add2-ef7ef30f6c97 (02-Sep-2021 21:15:22.665) (total time: 10000ms): Trace[287967607]: [10.000136234s] [10.000136234s] END
  2021-09-02T17:15:32.000-04:00CopyI0902 21:15:32.665755 1 dialer.go:37] [96df57175088ac5: 10.212.128.231:9443] Dialed in 9.999951589s. I0902 21:15:32.665755 1 dialer.go:37] [96df57175088ac5: 10.212.128.231:9443] Dialed in 9.999951589s.
  2021-09-02T17:15:32.000-04:00CopyW0902 21:15:32.665772 1 dispatcher.go:134] Failed calling webhook, failing closed vingress.elbv2.k8s.aws: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": context deadline exceeded W0902 21:15:32.665772 1 dispatcher.go:134] Failed calling webhook, failing closed vingress.elbv2.k8s.aws: failed calling webhook "vingress.elbv2.k8s.aws": Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/validate-networking-v1beta1-ingress?timeout=10s": context deadline exceeded
  2021-09-02T17:15:32.000-04:00CopyI0902 21:15:32.665817 1 trace.go:205] Trace[1405009613]: "GuaranteedUpdate etcd3" type:*networking.Ingress (02-Sep-2021 21:15:12.662) (total time: 20003ms):Trace[1405009613]: [20.003294535s] [20.003294535s] END I0902 21:15:32.665817 1 trace.go:205] Trace[1405009613]: "GuaranteedUpdate etcd3" type:*networking.Ingress (02-Sep-2021 21:15:12.662) (total time: 20003ms): Trace[1405009613]: [20.003294535s] [20.003294535s] END
  2021-09-02T17:15:32.000-04:00CopyI0902 21:15:32.666082 1 trace.go:205] Trace[316703916]: "Patch" url:/apis/networking.k8s.io/v1beta1/namespaces/monitoring/ingresses/monitoring-ingress,user-agent:controller/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election,client:10.212.143.161 (02-Sep-2021 21:15:12.662) (total time: 20003ms):Trace[316703916]: ---"About to apply patch" 10001ms (21:15:00.664)Trace[316703916]: [20.003643114s] [20.003643114s] END I0902 21:15:32.666082 1 trace.go:205] Trace[316703916]: "Patch" url:/apis/networking.k8s.io/v1beta1/namespaces/monitoring/ingresses/monitoring-ingress,user-agent:controller/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election,client:10.212.143.161 (02-Sep-2021 21:15:12.662) (total time: 20003ms): Trace[316703916]: ---"About to apply patch" 10001ms (21:15:00.664) Trace[316703916]: [20.003643114s] [20.003643114s] END
kishorj commented 2 years ago

@hetpats, Could you verify the following?

hoodieho commented 2 years ago

@hetpats I've had the same issue, check if workers security group allows connection from the cluster on required port (9443 by default or the one that is in the endpoint)

hetpats commented 2 years ago

@hetpats I've had the same issue, check if workers security group allows connection from the cluster on required port (9443 by default or the one that is in the endpoint)

@HoodieHo That Did it, found the deny in flowlogs , added the rules for 9443 from cluster to worker SG and it worked!! Thanks