kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.94k stars 1.46k forks source link

Non-Transactional RevokeSGIngress Call Risks Full Cluster Connectivity Loss #3822

Open asaf400 opened 2 months ago

asaf400 commented 2 months ago

Describe the bug https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/e1d32f4f83eebc2d7bc61f600dc5e22770102e3e/pkg/networking/security_group_reconciler.go#L104-L114

Upon reconciliation of security group ingress ports rule, specifically the function reconcileIngressWithSGInfo, It's possible for the Controller to successfully revoke the cluster traffic rule RevokeSGIngress, but later (within milliseconds?**), Calling AuthorizeSGIngress could fail if the cluster includes a manually created TargetGroupBinding with a misconfiguration, for example:

...
spec:
  networking:
    ingress:
      - from:
          - securityGroup:
              groupID: sg-defacedeface123456
...

Under the condition where the user specified groupID doesn't exists, due to the parameter roll-up (single batch for all permissionsToGrant) AuthorizeSGIngress would fail, thereby the cluster is left without any ingress rule for the backend traffic security group, and all Load Balancers fail to health checks and or reach pods (VPC CNI).

Expected outcome Even when runtime configuration of the controller is not configured with --disable-restricted-sg-rules={{ .Values.disableRestrictedSecurityGroupRules }}, this error state should be handled with a short-circuit before a cluster networking disaster occurs.

This can be achieved by any or all of these options:

Immediate resolution For immediate resolution of the connectivity issue, restarting the LBC pods is possible and worked in my case, Didn't dive any deeper into the code, but from the cause and effect, seems like the controller's bootstrapping successfully restores the security group ingress rule, and thus ingress works again (temporarily), even though the invalid TGB is still present, and has not been deleted.

**Additionally, it should be noted that as a result of the logic flow, there may be a timeframe, maybe few milliseconds, maybe seconds where the cluster has revoked existing rules, and even though the next authorize would succeed, unless the sdk itself is batched, the Authorize call is called after the revoke and AWS EC2 API is not nanosecond fast and the cluster remains during that time in an unconnected state!

Environment

Additional context

The cluster connectivity does not get affected by the creation of the faulty TargetGroupBinding, and moreover, the binding itself, is functional and works as expected (as long as the ports are already approved - reuse from other cluster services in other namespaces)

The Cluster (ingress) would only get disconnected from AWS networking when a reconcile cycle is invoked by the controllers, in my case, it took 3 days between TargetGroupBinding creation, and complete cluster downtime, the reconciliation has occurred to the pod with the highest port in the cluster being restarted, and no healthy target available for a brief time (alertmanager port 9093), The reconcile wanted to change the ingress rule from port range 80-9093 to range 80-8088

Edit: Typo + Added 'Immediate resolution section'

asaf400 commented 2 months ago

ping @M00nF1sh @kishorj @oliviassss - If you are able, remove the user @xiaobai-marker, this is clearly phishing 🎣

zac-nixon commented 1 month ago

Hi!

Yes, I agree that this is a bug. I've consulted with the EC2 folks that own these SG modification APIs and they currently do not provide the needed APIs to implement a transactional experience. They agreed to take on a feature request, with no tentative timeline for implementation :(.

The suggestions you provide are good but not really feasible to implement:

The controller should use DryRun parameter on the AuthorizeSGIngress api call, before revoking the rule, API-Doc

This approach fails when we first need to remove the existing rules before adding new rules. This is due to limits on the number of rules per SG. While we could add a complex flow of add first, if fail on sg rule limit do remove, then add again. This doesn't solve the underlying problem.

The entire logic flow of the reconcileIngressWithSGInfo function should be 'transactional', Meaning upon a failure within the context, the controller should have a recovery method to restore the existing state before the error.

We don't currently have a good stateful mechanism to track this.

I think a good short term solution is keep a 'validated' SG cache, where we first ensure that the referenced SG exists before trying to use it in rules.