Open somdoron opened 1 month ago
Thanks for bringing this to our attention -- we will look deeper into this issue
@somdoron, we had a fix on retrying failed SG rules during TGB reconciliation, and maybe we can do a similar enhancement on the SG IDs https://github.com/kubernetes-sigs/aws-load-balancer-controller/pull/3296 But it will be a bit tricky, if the TGB continues reconciliation with ignoring invalid SG IDs, it may raise some security concerns.
thanks @oliviassss. I think any TGB with invalid SG ID need to be skipped from from the list of valid ports. So I don't think that would create a security concern, because the worst case scenario, that specific TGB won't have access, instead of all of the TGB.
Describe the bug We accidentally configured a TargetGroupBinding with an invalid security group ID. The controller proceeded to configure the target group via the AWS Load Balancer (ALB) without failing or checking whether the security group existed.
Subsequently, the AWS Load Balancer controller attempted to authorize the rule but failed silently, without any log messages. The only indication of the issue was the controller repeatedly trying to create the rule. We identified the failure by reviewing AWS CloudTrail.
A day later, one of our nodes was replaced by Karpenter, which caused the port range to change. The controller successfully issued the revoke command and immediately tried to recreate the rule. However, it failed due to the invalid security group ID. Once again, the failure was silent, with no log messages recorded. We discovered this issue through AWS CloudTrail.
At that point, all load balancers configured by the controller were not functioning.
Steps to reproduce
Expected outcome
Environment
Additional Context: Probably related issue: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3639