kubernetes-sigs / aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
https://kubernetes-sigs.github.io/aws-load-balancer-controller/
Apache License 2.0
3.94k stars 1.46k forks source link

The way the LBC handles reconciling security groups seems dangerous #3639

Open 2rs2ts opened 7 months ago

2rs2ts commented 7 months ago

Describe the bug

The way that the LBC reconciles SG rules is to revoke old rules, then add new ones, which I think is liable to cause downtime in a variety of scenarios. See here for the implementation: https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/c2515fe5a096af43a9bb81ba14374ac9f84e5704/pkg/networking/security_group_reconciler.go#L104-L114

I believe this is obviously problematic, and I feel like it must be this way only because of some technical constraint or an oversight, but either way, it seems dangerous to use for production systems that bear a high amount of load. For example, let's say we're using EKS and we're adding a new LoadBalancer service that happens to get assigned a new NodePort that is higher than any used to date thus far. When using the managed shared backend security group setting (which is the default, I believe), the LBC will call RevokeSecurityGroupIngress, on the security group that all the EKS nodes use, the rule that uses that shared backend security group as the source security group ID, and has a port range of the lowest provisioned NodePort to the previously highest provisioned NodePort, then call AuthorizeSecurityGroupIngress to add a new rule just like it, but with the new highest NodePort as the upper bound. This change occurs in around a second (based on container logs and CloudTrail entries) but during that second, that means there is no rule that allows ingress traffic from your NLBs to your nodes. (Although, I would like to note that sometimes CloudTrail says the AuthorizeSecurityGroupIngress call comes in first, but the container logs never do, so maybe there's some eventual consistency in CloudTrail that means I shouldn't rely upon it.)

I understand that since SGs are stateful, existing client connections won't be cut off, but I was surprised to read in logs and in source that it works this way. I expected that this would break new connections or cause at least some sort of mayhem, but in my (limited) testing of around ~120 req/s to an existing NLB while I was spinning up a new one that caused this SG rule reconciliation, I didn't notice any downtime. I asked around on the k8s slack about it, and people surmised that there might be some request batching that makes there be no downtime for such short bursts, but no one knew what the contingency plan would be if the AuthorizeSecurityGroupIngress API call failed. So, I asked AWS Support, but their answer wasn't encouraging; I'm still going to work with them on this, but, they made it sound like there is no batching occurring, so aside from the inherent eventual consistency of AWS systems, and the statefulness of connections under SGs, there's nothing that should be making traffic keep "floating" during what should actually be a blip of some sort. However, they didn't really address my other concerns, so I do believe that it's possible they misunderstood me entirely, and that there is actually some sort of batching.

Even then, server-side batching doesn't address all the problems here. For example, what if the LBC crashes (due to an OOM or some container runtime issue) after the RevokeSecurityGroupIngress call successfully goes through, but before the AuthorizeSecurityGroupIngress rule is made? Or what if there is a networking issue for just long enough at that brief moment? What if there is some sort of slowdown with the LBC or with the AWS API that makes it take longer to do all this, making the probability of connection failure more likely? Another case that AWS Support brought up that I didn't think of, which they seemed to be trying to assure me would not be a problem but actually just added to my list of concerns is the authorize call getting rate limited. And this is to say nothing of situations like someone misconfiguring their LBC at some point to remove AuthorizeSecurityGroupIngress from its IAM permissions, or some bad actor doing that to them–it's not like you can protect against all footguns in the world, but this seems like it's designed to be a footgun.

So, I have come to raise my concerns here as well. I believe this is a dangerous way of doing things. All it takes is for a single AuthorizeSecurityGroupIngress call to fail, and then suddenly, your Nodes will stop accepting traffic from all the NLBs whose backend SG rules are being managed by the LBC. I do believe this can be all side-stepped by using --disable-restricted-sg-rules=true when starting the LBC, though I'm not keen on that because it seems like a smell: there are good reasons to restrict the SG port range, both as a matter of good practice no matter what the deployment environment is like, and as a matter of making the component safer for use in clusters where one might have some sort of host networking mode containers exposing ports that are only meant to be reachable from within the cluster, or any other novel arrangement that one is likely to find oneself in when doing Kubernetes the Hard Way instead of using EKS.

Steps to reproduce

If you wanted to provoke the problem presented here, you could do some sort of fiddling with your IAM permissions, or use an outbound proxy that watches the logs of the LBC to decide whether the thing is trying to do an AuthorizeSecurityGroupIngress call and block it, or if you want to be a bad customer, you could trigger API rate limits by automating some system somewhere that will make AuthorizeSecurityGroupIngress start getting RequestLimitExceeded errors. I guess you could also try to cause OOMs or kill the container in the middle of the reconciliation process, but, those seem harder to accomplish precisely.

Anyway, once you have introduced this chaos, simply create a new Service and the LBC will delete all your ingress rules and fail to replace them, and now all your NLBs should be in a broken state.

Expected outcome

If possible, AuthorizeSecurityGroupIngress should happen first. I feel like that should be always safe to do, since I don't think rule overlap triggers the rule duplication logic in backend SG rule reconciliation. Like, let's say your rule is going to from allowing ports 31000-32000 to 31000-32500, then just make a new rule like that, then delete the old one. If necessary, you could use rule descriptions to help differentiate the rules more. This should work, right?

But if I'm wrong about that, then, I'm not sure how to truly fix this besides changing how things work on the backend. Obviously, introducing batching without API changes would ensure that the happy-path scenario (in which AuthorizeSecurityGroupIngress never fails to be called after a RevokeSecurityGroupIngress) never causes a blip of downtime. But introducing batching as a first class feature to the API, such that the whole operation is atomic from both a client and server perspective, would be best.

Finally, if those aren't possible, then I would think the best course of action is to remove this feature and replace it with something that lets the user simply specify the port ranges to allow in the managed rules. That would be safest, IMO.

Environment

Additional Context:

Once again, here's the link to the k8s slack thread where I asked about this, hoping to get a maintainer or SIG member to respond: https://kubernetes.slack.com/archives/C0LRMHZ1T/p1712080079771249

Also, it should be noted that I am used to the olden days of NLBs not having SGs; using SGs on NLBs is novel to me as of trying to get the LBC working in our new EKS clusters, so I may be making assumptions about how they work that apply to CLBs and EC2 instances, but don't apply to NLBs for some reason. Sorry for being behind the times. If my concerns are based on a fundamental misunderstanding then please, by all means, correct my misunderstanding and I'll be happy to hear that there's no problem here at all.

Thanks in advance for your time and attention.

M00nF1sh commented 7 months ago

@2rs2ts Thanks for reporting this and i ack that in theory this could happen(like the cases you provided). However, in reality, this may not have happened(we don't have tickets about this) due to those calls are made within milliseconds and there is a propagation delay for ec2's sg changes.

I agree that we can make this better by grant new permissions first. Will find time to test this to make sure there isn't a case that rules cannot be added first.

/kind bug

2rs2ts commented 7 months ago

Thanks for the reply, @M00nF1sh! I appreciate that you agree that in the worst case scenario, this could be really bad, and you'll investigate whether you can improve the viability of these worst (or at least, somewhat worse) cases.

When you say "within milliseconds", my experience is to the contrary: it's usually around a second, both based on the timestamps in the LBC's logs as well as CloudTrail events. And the propagation delay you mention, yes I've experienced that, but it's usually on the order of a couple seconds: I will kubectl apply something and then tab over to the web console, click the refresh button, and see the change has already been propagated in the time it took for me to do all that. I figured that this propagation delay might be where some requests are getting batched together, but, the AWS Support representative I spoke to told me that no batching occurs on the backend. So, something doesn't add up 😅

It's true that if it was a problem in the "happy path," everyone would be complaining about it, so part of my investigation has been trying to figure out exactly why this hasn't happened yet. Our investigations thus far have indicated that there definitely is more than a 0% chance of the "happy path" causing downtime unless there is batching happening in the backend, so, at the risk of questioning the word of the AWS Support representative... are you able to comment on whether there is batching going on in the backend? To be clear, "batching" to me means that the similar API calls for revoking and authorizing ingress rules get grouped together as a single delta that is atomically applied, and that this process is part of the reason for the propagation delay.

RaviVadera commented 5 months ago

Were just hit by this. We are using a mix of ingresses and targetGroupBindings in our stack. When refactoring some ingresses, all the targetGroupBindings stopped working and were unreachable. After few long (very long) days going through, we discovered that we were lucky up until now. The ports for all bindings fortunately seemed to fall under the range of two different ingresses we had. Then the horrific realization that we are missing port configuration on all bindings when we put those in for the first time.

Documentation was not quite clear on the security group handling, it did mention that networking should be handled externally but for couple of days gave wrong impression that the alb controller should have been handling those ranges as in our previous stack.

2rs2ts commented 5 months ago

@RaviVadera Sorry to hear that that happened to you. I feel a bit vindicated in my paranoia. Our team went with setting --disable-restricted-sg-rules to true as documented here. Hopefully you've done the same already, and if not, my comment will help you.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

2rs2ts commented 2 weeks ago

/remove-lifecycle rotten