Improve service's node selection for load balancers

jkemp101 commented 7 years ago

There have been a number of issues/PRs to adjust the service controller's behavior with respect to the nodes it selects to add to a LoadBalancer. Many of these issues have focused on keeping the masters out of the load balancers which was an important first step. However, there are other use cases that are still not met with the current functionality.

Current issues:

Clusters with varying node sizes will route traffic through a node with less resources (i.e. network bandwidth, CPU, and/or memory) to instances that have more resources. This can cause significant impact to user facing applications. The only workaround is to run multiple clusters to handle differing workloads.
Even clusters with the same node types could have varying workloads. It would not make sense to route customer facing web traffic through worker nodes that might have large batch processing jobs that consume all of the node's resources. Again this could cause an undesirable impact to higher priority data flows.
Finally from a security standpoint it is good practice to isolate traffic for a set of pods away from nodes that aren't schedulable for these pods in the first place.

Functional requirements:

Services should only add nodes to load balancers that are schedulable by the pods the service fronts. Conversely, traffic should never be routed through nodes that the service's pods are not able to be scheduled on.
- If a deployment has one replica with a node selector that matches three nodes the service would route the traffic through all three nodes even though there is only going to be a single node with the pod. There is no need to try to sync the load balancer's nodes to the actual placements of pods.
Node maintenance should have similar impact to services and pods
- When a node is cordoned it should not be included in load balancers for new services
- When a node is drained it should remove that node from all existing load balancers

Approach:

Add a "load balancer node selector" to services like we have for deployments to determine the nodes that should be used for load balancers. It will be up to the user to make sure the service's node selector makes sense compared to the pods it fronts
Automatically select nodes based on the service's pod node selectors. It would be the union of all nodes the pods can be scheduled on.

jkemp101 commented 7 years ago

Trying to consolidate discussions on how to improve what nodes are used by load balancers. Above is my first cut of how it could work. Would love to flesh this out and get it as a priority. I have to implement some significant workarounds to deal with these issues. I would think this would impact many people using k8s in production.

Here are some of the related discussions mainly focused on keeping masters out of load balancers. https://github.com/kubernetes/kubernetes/issues/33884 https://github.com/kubernetes/kubernetes/pull/44745 https://github.com/kubernetes/kops/issues/853 https://github.com/kubernetes/kops/issues/639 https://github.com/kubernetes/kubernetes/issues/44319

jkemp101 commented 7 years ago

@thockin Any thoughts on this approach?

erictune commented 7 years ago

@kubernetes/sig-network-feature-requests

thockin commented 7 years ago

It's an interesting problem, for sure.

Services should only add nodes to load balancers that are schedulable by the pods the service fronts.

All of the pods? What if the pods behind a Service have different node selectors?

Do we have grounds to believe that nodeSelector is commonly used enough to matter here?

Node busyness (e.g. running a large batch as you suggest) is temporary. Updating LB backend sets can be very slow and expensive (GCP is slow, anyway). We don't want to update LB sets dynamically..

I really really don't want to expose this to users to decide. Firstly, it's significant cognitive burden and complexity, where we are already overly complicated. But more importantly - using nodes for LB is an implementation detail. There are implementations of LBs that don't do that at all, and I hope this will be more common over time.

We've started the work of adding proper healthchecks for L4 balancing. Perhaps we can use that as a way to remove nodes that are bad choices? With the OnlyLocal feature, that already happens - we only route to nodes that actually have backends running. Maybe that is enough?

We've had some discussions about returning "cost" in health checks. Maybe that goes further?

@boweidu @nicksardo @freehan

On Thu, May 4, 2017 at 2:47 PM, Eric Tune notifications@github.com wrote:

@kubernetes/sig-network-feature-requests https://github.com/orgs/kubernetes/teams/sig-network-feature-requests

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/45234#issuecomment-299318632, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVN8vDId3ddyrg-XKnJYWF8EJPUKwks5r2keMgaJpZM4NOcsG .

jkemp101 commented 7 years ago

Some thoughts:

I agree that we should keep changes to the LB to a minimum. They aren't meant to have many on the fly node changes.
My particular implementation actually has two deployments behind a single service. I have my regular web deployment and a canary deployment for the same app. I would expect the service to route traffic to any node those two deployments may put a pod. I am alright in this situation that traffic may be routed to a node for deployment A that deployment B can not put a pod. Since they are both fronted from the same service I think its safe to assume they have similar workloads/requirements so this doesn't seem like a problem.
I think any dynamic provisioning using a "cost" is going to be complicated and hard to get right. We wouldn't want to rely on a system that adjusts itself after it impacts our customers. In my case I have customers that have cronjobs that hit us very hard at regular intervals. It would be hard to dynamically change the LB after one of these large spikes start without already impacting the system.

I just can't think of a use case where a service should ever send traffic to a node for a pod that the pod is not allowed to run on other than the situation where service's selector matches multiple deployments.

xiangpengzhao commented 7 years ago

@thockin I think you @ wrong bowei :)

thockin commented 7 years ago

@jkemp101 I understand your point, and I don't even disagree with it, but I'm not sure I see how to implement it cleanly. If you use OnlyLocal you will get this behavior today (in exchange for a different set of risks). I could maybe see using healthchecks for this, but to do that, every node would have to watch all pods, which is pretty much a non-starter (too expensive).

It's tricky to do generically because we don't really know the implementation details of any given installation - whether they use healthchceks, whether they LB to nodes or pods, etc. I'm open to clever ideas, but I don't think this should be in the user's domain..

jkemp101 commented 7 years ago

@thockin Are you leaning towards approach 1?

Approaches:

There is no reason to route traffic to nodes that would never have a pod running on it. Therefore we should automatically figure out the correct nodes and take care of all of the details behind the scenes for the user. This is probably a lot of work.
Adding an additional "selector" to services would work but adds yet more complexity the user would have to deal with. Probably not as difficult to implement as the first option.

I'm fine with the second option. I don't see it different from having to pick the nodes for a deployment with a selector. This also allows special cases like routing all web traffic through Intrusion Detection System (IDS) (i.e. snort) nodes (this goes against some of my previous statements). And things would work the same way they work today if this additional selector wasn't specified with traffic going to all nodes in the cluster.

thockin commented 7 years ago

I'm against option 2 until option 1 has been explored and decided to be prohibitive.

On Fri, May 5, 2017 at 1:14 PM, Joe Kemp notifications@github.com wrote:

@thockin https://github.com/thockin Are you leaning towards approach 1?

Approaches:

There is no reason to route traffic to nodes that would never have a pod running on it. Therefore we should automatically figure out the correct nodes and take care of all of the details behind the scenes for the user. This is probably a lot of work.

Adding an additional "selector" to services would work but adds yet more complexity the user would have to deal with. Probably not as difficult to implement as the first option.

I'm fine with the second option. I don't see it different from having to pick the nodes for a deployment with a selector. This also allows special cases like routing all web traffic through Intrusion Detection System (IDS) (i.e. snort) nodes (this goes against some of my previous statements). And things would work the same way they work today if this additional selector wasn't specified with traffic going to all nodes in the cluster.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/45234#issuecomment-299564792, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVD3RLOKjippbmD3E6F_xu4PwC3PAks5r24MdgaJpZM4NOcsG .

prachetasp commented 7 years ago

Has anything become clearer on this issue? As of right now we are managing our load balancers manually via Terraform as a work around but would be much cleaner to be able to manage it through k8s itself.

jkemp101 commented 7 years ago

This is still an issue for me. I had planned to run two identical clusters for redundancy but unfortunately had to make one of the clusters run nodes of all the same type so that I could ensure our user facing web requests always went through a node with sufficient network bandwidth.

I'm currently trying to understand some 504 errors related to AWS Classic ELBs and k8s services. Once I figure that out and determine if I need to switch to ALBs I plan to checkout the ALB ingress controller and see how that would relate to this issue.

thockin commented 7 years ago

There's effectively 0% chance this is solved in 1.8. If we want any chance to solve it in 1.9, we'll need to consider options early in the cycle. I don't like putting this in the user's hands. This selector doesn't apply to all LB implementations, which makes it something of a kludge. I'd prefer we explore other options that apply more widely.

vpm-bradleyhession commented 7 years ago

I would like this also. We already run multiple clusters for QA, Dev and Prod. Adding another IMHO to separate traffic between ELBs is not really a nice solution.

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten /remove-lifecycle stale

stephbu commented 6 years ago

/remove-lifecycle rotten

I'd like to wake this one back-up. Having all nodes participate in all services produces security challenges that prevent more concrete partitioning of networking e.g. dividing nodes between VPCs, ACLs etc., walling off masters from the rest of the world etc.

jkemp101 commented 6 years ago

This is still a priority for me but I have recently started building clusters using the alb controller. I haven't had time to investigate but I was curious if ingresses provided enough information that perhaps I could add this logic to something like the alb controller to direct traffic more appropriately. In any case ingresses and external provisioners like the alb controller add a new dimension to this problem compared to when I started. And at this point I don't really care about k8s provisioned classic load balancers (CLBs) because everything needs to move to ALBs.

stephbu commented 6 years ago

I think I care about this at two levels.

not all nodes should present a service IP. Reasoning that I should be able to partition and shape my service traffic to specific nodes, with specific capacity, and specific security boundaries.
the LB should be bound to only the nodes that are presenting the service IP. Such that only those specific nodes and ports in 1) are bound to the LB. Makes the security guys happy.

cloudbow commented 6 years ago

Yes , Why should LB talk to a node which does not have a service. It should not. We have a case where we see varying response time between requests because sometimes the requests go through a server which is not very healthy?

jkemp101 commented 6 years ago

@thockin Looks like a different approach for this issue was taken in https://github.com/kubernetes/kubernetes/issues/54743 and also starting to be discussed in https://github.com/coreos/alb-ingress-controller/issues/325. Is the plan to move forward with marking nodes where services should not send traffic versus having the ability to configure services to route traffic to the desired nodes? I feel like workarounds are being implemented that are going to affect the long term complexity of this issue.

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten /remove-lifecycle stale

fejta-bot commented 6 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

chrissnell commented 6 years ago

Is it possible to limit the number of nodes being added to a LB? I've noticed that I will see 504 Gateway Timeouts on my service ELBs when cluster autoscaling adds/removes a bunch of new nodes. I'm not sure what's going on here but it feels like my service LBs are more reliable when nodes are not constantly being added and removed.

daviddyball commented 6 years ago

/reopen

k8s-ci-robot commented 6 years ago

@daviddyball: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/kubernetes/issues/45234#issuecomment-433004376): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

daviddyball commented 6 years ago

I think this is still important. In heterogeneous clusters with different node types and workloads, I don't want my ELB traffic passed back to a node that might have a higher than normal workload and isn't even hosting the service that the ELB fronts.

vadimAnte commented 6 years ago

AWS EKS. 2 nodes group trough 2 different CloudFormation stacks. One stack in private subnets, second in public. All nodes in private subnets stack automatically labelled as "subnets: private". All nodes in public subnets stack automatically labelled as "subnets: public". Private subnets nodes stack: 2-10 nodes for application. Public subnets nodes stack: 2-3 nodes for LB (Traefik) only.

Would be nice to have some kind of node selection for LB services. I this case we could limit access to 80, 443 (for example) to Public subnets nodes by AWS Security group. Imagine that you need dedicated external IP for each application. Would be more secure to attach it to Public subnets nodes with Traefik only.

jibsonline commented 5 years ago

Is it implemented yet? It's a much needed feature

chrissnell commented 5 years ago

@jkemp101 would you re-open this? I agree with @jibsonline -- this is important. I am still dealing with a lot of 503s on my large+busy clusters because of this.

jkemp101 commented 5 years ago

/reopen

k8s-ci-robot commented 5 years ago

@jkemp101: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/45234#issuecomment-475021095): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/45234#issuecomment-485011523): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

hamzb commented 5 years ago

Any updates on this?

paul-maxime commented 5 years ago

Can we reopen this? Our cluster contains both normal nodes and preemptible nodes (https://cloud.google.com/compute/docs/instances/preemptible). Load balancers are also using preemptible nodes while we only want them to use stable nodes.

countergram commented 4 years ago

This is still relevant - is there any other way to do this? Besides the numerous valid use cases above, we have a case where a network issue with a node group -- that has nothing to do with the service -- and which cannot be rotated at a moment's notice because it contains instance-store databases, is causing frequent 504s on a service.

rafael-leo-almeida commented 4 years ago

/reopen please consider this feature.

liuxinbot commented 3 years ago

/reopen please consider this feature.

jkemp101 commented 3 years ago

/reopen

k8s-ci-robot commented 3 years ago

@jkemp101: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/45234#issuecomment-768299657): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-ci-robot commented 3 years ago

@jkemp101: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

thockin commented 3 years ago

Can we go back to first-concepts on this? The ecosystem has evolved a lot since 2017 ;)

People who are reopening it - what is your problem statement?

For the use-case of not using preemptible VMs to do LB traffic - can you use the node.kubernetes.io/exclude-from-external-load-balancers: "true" annotation?

What else?

jvidal-sysadmin commented 3 years ago

It seems this feature was introduced in version 1.19:

Search the next in the announce: service.beta.kubernetes.io/aws-load-balancer-target-node-labels annotation to target nodes in AWS LoadBalancer Services

Detailed doc: https://v1-19.docs.kubernetes.io/docs/concepts/services-networking/service/#other-elb-annotations

segevfiner commented 3 years ago

It seems this feature was introduced in version 1.19:

Search the next in the announce: service.beta.kubernetes.io/aws-load-balancer-target-node-labels annotation to target nodes in AWS LoadBalancer Services

Detailed doc: https://v1-19.docs.kubernetes.io/docs/concepts/services-networking/service/#other-elb-annotations

But only for AWS, using an AWS specific annotation.

jvidal-sysadmin commented 3 years ago

Correct, but since 2017...something is something!

thockin commented 3 years ago

node.kubernetes.io/exclude-from-external-load-balancers: "true" is in 1.19, that is correct. It should work for all providers.

I do apologize for this bug being open so long, but the original formulation wasn't viable (still isn't) and since so much time has passed and the thinking has changed around this topic, I'd really like to refresh the problem statement.

jkemp101 commented 3 years ago

I'm not directly involved in this part of k8s anymore. So I'm fine closing this issue and someone can create a new one that is relevant to their specific issue and can address limitations of the current versions of k8s and their capabilities. Our own architecture has evolved significantly since I created this issue and so we aren't directly impacted by this anymore.

thockin commented 3 years ago

I'll leave this open a bit, and if we don't hear new requirements, we will close. We can always reopen.

thockin commented 3 years ago

OK, I'll close it now. If someone stumbles on this and has new input, please respond!

kubernetes / kubernetes

Improve service's node selection for load balancers #45234