kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.88k stars 1.44k forks source link

TargetGroupBinding doesn't fail when used with invalid security group #3809

Open somdoron opened 1 month ago

somdoron commented 1 month ago

Describe the bug We accidentally configured a TargetGroupBinding with an invalid security group ID. The controller proceeded to configure the target group via the AWS Load Balancer (ALB) without failing or checking whether the security group existed.

Subsequently, the AWS Load Balancer controller attempted to authorize the rule but failed silently, without any log messages. The only indication of the issue was the controller repeatedly trying to create the rule. We identified the failure by reviewing AWS CloudTrail.

A day later, one of our nodes was replaced by Karpenter, which caused the port range to change. The controller successfully issued the revoke command and immediately tried to recreate the rule. However, it failed due to the invalid security group ID. Once again, the failure was silent, with no log messages recorded. We discovered this issue through AWS CloudTrail.

At that point, all load balancers configured by the controller were not functioning.

Steps to reproduce

  1. Create a TargetGroupBinding with a valid security group ID and port 80.
  2. Create a TargetGroupBinding with a valid security group ID and port 9000.
  3. Create a TargetGroupBinding with an invalid security group ID and port 9003.
  4. Delete the TargetGroupBinding with port 9000.
  5. The first valid TargetGroupBinding on port 80 is now unavailable.

Expected outcome

  1. TargetGroupBinding should fail if the security group ID is invalid, and a log message should be recorded.
  2. An error-level log message should be recorded when revoking or authorizing a security group rule fails.
  3. The revoke and authorize operations should first perform a dry run, and only if successful should they proceed to apply the changes.

Environment

Additional Context: Probably related issue: https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/3639

ChuksGrinage commented 1 month ago

Thanks for bringing this to our attention -- we will look deeper into this issue

oliviassss commented 1 month ago

@somdoron, we had a fix on retrying failed SG rules during TGB reconciliation, and maybe we can do a similar enhancement on the SG IDs https://github.com/kubernetes-sigs/aws-load-balancer-controller/pull/3296 But it will be a bit tricky, if the TGB continues reconciliation with ignoring invalid SG IDs, it may raise some security concerns.

somdoron commented 1 month ago

thanks @oliviassss. I think any TGB with invalid SG ID need to be skipped from from the list of valid ports. So I don't think that would create a security concern, because the worst case scenario, that specific TGB won't have access, instead of all of the TGB.