Load Balancer creation is not re-entrant, ELB Listeners (+ target group) failure is never retried

JoelSpeed commented 4 weeks ago

/kind bug

What steps did you take and what happened:

We were creating a new cluster and the AWSCluster resource reconciliation hit an error. In particular, we hit a rate limit which meant that a CreateTargetGroup call failed.

The AWSCluster is then requeued, and the LB is reconciled again.

Because the LB is created already, the decribeLB returns the load balancer and we fall into the update logic for the managed load balancer.

However, this logic does not include any management of the target groups or listeners, which leaves the load balancer in a pretty useless state with no listeners.

What did you expect to happen:

When the load balancer creation fails part way, the next reconcile should complete any incomplete steps, in particular, it should check that the target group and listener spec are accurate.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api-provider-aws version: 2.5.0
Kubernetes version: (use kubectl version): 1.30
OS (e.g. from /etc/os-release):

nrb commented 4 weeks ago

/triage accepted

nrb commented 4 weeks ago

This behavior needs to be fixed regardless, but I'm curious - did the account have a particularly low rate limit? It seems unusual to me that we'd hit a rate limit during cluster creation.

JoelSpeed commented 4 weeks ago

The account has a higher than normal rate limit, but the account also creates/destroys 1000s of clusters a week as part of CI pipelines, so small changes to the numbers of API calls amplify pretty quickly in this environment.

nrb commented 4 weeks ago

/priority critical-urgent /assign

kubernetes-sigs / cluster-api-provider-aws

Load Balancer creation is not re-entrant, ELB Listeners (+ target group) failure is never retried #5002