kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.88k stars 1.44k forks source link

Unexpected Load Balancer Deletion with delete_protection=true When Using IngressGroup in AWS Load Balancer Controller #3817

Open sergeylanzman opened 3 weeks ago

sergeylanzman commented 3 weeks ago

Describe the bug We are using the AWS Load Balancer Controller with a single Ingress for multiple requests. We decided to separate these into multiple Ingresses using IngressGroup. All our load balancers are marked with the annotation delete_protection=true. However, after adding the groupname annotation to the Ingress, the controller unexpectedly deleted the load balancer and created a new one. This behavior is unexpected since delete_protection=true was set.

If I attempt to delete the Ingress, the load balancer is not deleted, but when changing annotations, the controller attempts to delete the load balancer, resulting in an OperationNotPermittedException error. After this error, the controller disables delete protection and deletes the load balancer again.

This issue is critical as it affects production environments by causing downtime due to the need to update DNS, re-register all targets, and more. The issue seems to originate from the code in load_balancer_synthesizer.go. Specifically, this section appears to be responsible for the unexpected behavior, and it might be necessary to remove or modify this logic for prevent downtime and issue from customers.

Steps to reproduce

Expected outcome When delete_protection=true is set, the controller should not delete the load balancer regardless of changes to annotations. Additionally, the controller should not disable delete_protection to delete the load balancer.

Environment AWS Load Balancer controller version: v2.8.2 Kubernetes version: 1.28 Using EKS: yes 1.28 Additional Context Adding an IngressGroupName should not cause downtime. However, currently, to add an IngressGroupName, it seems the only option is to create a new Ingress and load balancer, then switch traffic, which is complicated and introduces downtime. It might be beneficial to consider a way to set a default IngressGroupName or improve the process to avoid downtime.

Relevant Issues

Issue #1: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2271 Issue #2: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3034

shraddhabang commented 3 weeks ago

Hello @sergeylanzman , as documented here, the ALB for an IngressGroup is found by searching for an AWS tag ingress.k8s.aws/stack tag with the name of the IngressGroup as its value. So when you convert your ingress to use ingressGroup the required tag is not found and hence it creates a new ALB with required tag and deletes the old one. If you want to use the existing ALB only to avoid the downtime, you can just change this ingress.k8s.aws/stack to you new group name and apply the annotation change of group.name. This way the old ALB will not be deleted and there will be no downtime. I tried this on my setup can you give it a try?

sergeylanzman commented 3 weeks ago

@shraddhabang I understand why the ALB was recreated due to the missing ingress.k8s.aws/stack tag when converting to an IngressGroup. However, from my point of view, relying on manual tag updates in a production environment feels risky and somewhat hacky, especially as it's not clearly documented.

I believe that if a resource is flagged with delete_protection=true, the controller—or any other tool—should strictly adhere to that flag and not delete the resource unless the flag is manually unset. This behavior is consistent across other AWS services with delete_protection (e.g., RDS, DynamoDB, Shield, etc.). I am not aware of any AWS tools (Terraform, AWS CLI, AWS CDK, AWS SDK, etc.) that bypass this protection without manual intervention, including other AWS controllers from aws-controllers-k8s.

In my opinion, the current behavior makes the delete_protection flag effectively unusable in scenarios where dynamic changes are required. It would be more reliable and easy if the controller could automatically update the necessary tags or, at the very least, respect the delete_protection flag more strictly to prevent unintentional deletions.