kubernetes-sigs / cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.
http://cluster-api-aws.sigs.k8s.io/
Apache License 2.0
626 stars 542 forks source link

Duplicate target groups and listeners created in the ELBv2 reconcile loop #5015

Closed r4f4 closed 2 weeks ago

r4f4 commented 2 weeks ago

/kind bug

What steps did you take and what happened:

With https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5004, target groups and listeners are now reconciled in their own loop. The problem now is that every time the reconcile loop runs, this check is never true because the target desiredSpec contains newly-generated values for the TG names in every iteration.

There is a chance the e2e tests do not create v2 ELBs, so this code path was never exercised.

What did you expect to happen:

The reconcile loop can identify when it's done and it doesn't try to create duplicate target groups and listeners.

Anything else you would like to add:

Here is an excerpt of the logs from a failed run in openshift:

time="2024-06-12T07:44:37Z" level=debug msg="I0612 07:44:37.306366   13667 loadbalancer.go:60] \"Reconciling load balancers\" controller=\"awscluster\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\" namespace=\"openshift-cluster-api-guests\" name=\"ci-op-l4w992fp-ff55c-th7fz\" reconcileID=\"834fa2e6-886b-45f1-b0b0-6d4a71b2c43d\" cluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\""
[...]
time="2024-06-12T07:44:38Z" level=debug msg="I0612 07:44:38.740417   13667 loadbalancer.go:1673] \"creating target group\" controller=\"awscluster\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\" namespace=\"openshift-cluster-api-guests\" name=\"ci-op-l4w992fp-ff55c-th7fz\" reconcileID=\"834fa2e6-886b-45f1-b0b0-6d4a71b2c43d\" cluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\" group=<"
time="2024-06-12T07:44:38Z" level=debug msg="\t{"...
time="2024-06-12T07:44:38Z" level=debug msg="\t}"
time="2024-06-12T07:44:38Z" level=debug msg=" > listener={\"protocol\":\"TCP\",\"port\":6443,\"targetGroup\":{\"name\":\"apiserver-target-wvkc5\",\"port\":6443,\"protocol\":\"TCP\",\"vpcId\":\"vpc-0376cc250de43b527\",\"targetGroupHealthCheck\":{\"protocol\":\"HTTPS\",\"path\":\"/readyz\",\"port\":\"6443\",\"intervalSeconds\":10,\"timeoutSeconds\":10,\"thresholdCount\":2,\"unhealthyThresholdCount\":2}}}"
time="2024-06-12T07:44:47Z" level=debug msg="I0612 07:44:47.211986   13667 loadbalancer.go:1673] \"creating target group\" controller=\"awscluster\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\" namespace=\"openshift-cluster-api-guests\" name=\"ci-op-l4w992fp-ff55c-th7fz\" reconcileID=\"834fa2e6-886b-45f1-b0b0-6d4a71b2c43d\" cluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\" group=<"
time="2024-06-12T07:44:47Z" level=debug msg="\t{"...
time="2024-06-12T07:44:47Z" level=debug msg=" > listener={\"protocol\":\"TCP\",\"port\":22623,\"targetGroup\":{\"name\":\"additional-listener-thtx2\",\"port\":22623,\"protocol\":\"TCP\",\"vpcId\":\"vpc-0376cc250de43b527\",\"targetGroupHealthCheck\":{\"protocol\":\"HTTPS\",\"path\":\"/healthz\",\"port\":\"22623\",\"intervalSeconds\":10,\"timeoutSeconds\":10,\"thresholdCount\":2,\"unhealthyThresholdCount\":2}}}"
[...]
time="2024-06-12T07:44:52Z" level=debug msg="I0612 07:44:52.239773   13667 loadbalancer.go:60] \"Reconciling load balancers\" controller=\"awscluster\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\" namespace=\"openshift-cluster-api-guests\" name=\"ci-op-l4w992fp-ff55c-th7fz\" reconcileID=\"c98857e8-2e46-48a9-a380-2316f3d5431d\" cluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\""
time="2024-06-12T07:44:52Z" level=debug msg="I0612 07:44:52.932852   13667 loadbalancer.go:1673] \"creating target group\" controller=\"awscluster\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\" namespace=\"openshift-cluster-api-guests\" name=\"ci-op-l4w992fp-ff55c-th7fz\" reconcileID=\"c98857e8-2e46-48a9-a380-2316f3d5431d\" cluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\" group=<"
time="2024-06-12T07:44:52Z" level=debug msg="\t{"...
time="2024-06-12T07:44:52Z" level=debug msg=" > listener={\"protocol\":\"TCP\",\"port\":6443,\"targetGroup\":{\"name\":\"apiserver-target-ncpwh\",\"port\":6443,\"protocol\":\"TCP\",\"vpcId\":\"vpc-0376cc250de43b527\",\"targetGroupHealthCheck\":{\"protocol\":\"HTTPS\",\"path\":\"/readyz\",\"port\":\"6443\",\"intervalSeconds\":10,\"timeoutSeconds\":10,\"thresholdCount\":2,\"unhealthyThresholdCount\":2}}}"
time="2024-06-12T07:44:54Z" level=debug msg="I0612 07:44:54.481113   13667 loadbalancer.go:1673] \"creating target group\" controller=\"awscluster\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"AWSCluster\" AWSCluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\" namespace=\"openshift-cluster-api-guests\" name=\"ci-op-l4w992fp-ff55c-th7fz\" reconcileID=\"c98857e8-2e46-48a9-a380-2316f3d5431d\" cluster=\"openshift-cluster-api-guests/ci-op-l4w992fp-ff55c-th7fz\" group=<"
time="2024-06-12T07:44:54Z" level=debug msg="\t{"...
time="2024-06-12T07:44:54Z" level=debug msg=" > listener={\"protocol\":\"TCP\",\"port\":6443,\"targetGroup\":{\"name\":\"apiserver-target-bqtkq\",\"port\":6443,\"protocol\":\"TCP\",\"vpcId\":\"vpc-0376cc250de43b527\",\"targetGroupHealthCheck\":{\"protocol\":\"HTTPS\",\"path\":\"/readyz\",\"port\":\"6443\",\"intervalSeconds\":10,\"timeoutSeconds\":10,\"thresholdCount\":2,\"unhealthyThresholdCount\":2}}}"
time="2024-06-12T07:44:55Z" level=debug msg="E0612 07:44:55.159414   13667 awscluster_controller.go:280] \"failed to reconcile load balancer\" err=<"
time="2024-06-12T07:44:55Z" level=debug msg="\t[failed to create target groups/listeners for load balancer \"ci-op-l4w992fp-ff55c-th7fz-int\": failed to create listener: DuplicateListener: A listener already exists on this port for this load balancer 'arn:aws:elasticloadbalancing:us-west-1:XXXXXXXXXXXX:loadbalancer/net/ci-op-l4w992fp-ff55c-th7fz-int/c1f690c5b3971d60'"

Environment:

nrb commented 2 weeks ago

/triage accepted

mtulio commented 2 weeks ago

Hey @nrb @damdo there is a hypothetical leak situation we (@r4f4 and me) caught while reviewing https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5017 , I will describe here to document:

It is unrelated with the PR #5017 but may be related with the issue with new target groups and listeners which are now reconciled in their own loop ( https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/5004 ).

I was thinking how to reproduce leak with the server-side failure on API call CreateListener, but I can't see much options with gomock, and the AWS FIS does not support actions to ELB service API.

LMK if it is related here or want to open a new issue. Thanks

damdo commented 2 weeks ago

Hey @mtulio if we don't plan to fix it via #5017 let's create a new issue and track that there. Thanks!