Closed jkemp101 closed 3 years ago
Trying to consolidate discussions on how to improve what nodes are used by load balancers. Above is my first cut of how it could work. Would love to flesh this out and get it as a priority. I have to implement some significant workarounds to deal with these issues. I would think this would impact many people using k8s in production.
Here are some of the related discussions mainly focused on keeping masters out of load balancers. https://github.com/kubernetes/kubernetes/issues/33884 https://github.com/kubernetes/kubernetes/pull/44745 https://github.com/kubernetes/kops/issues/853 https://github.com/kubernetes/kops/issues/639 https://github.com/kubernetes/kubernetes/issues/44319
@thockin Any thoughts on this approach?
@kubernetes/sig-network-feature-requests
It's an interesting problem, for sure.
Services should only add nodes to load balancers that are schedulable by the pods the service fronts.
All of the pods? What if the pods behind a Service have different node selectors?
Do we have grounds to believe that nodeSelector is commonly used enough to matter here?
Node busyness (e.g. running a large batch as you suggest) is temporary. Updating LB backend sets can be very slow and expensive (GCP is slow, anyway). We don't want to update LB sets dynamically..
I really really don't want to expose this to users to decide. Firstly, it's significant cognitive burden and complexity, where we are already overly complicated. But more importantly - using nodes for LB is an implementation detail. There are implementations of LBs that don't do that at all, and I hope this will be more common over time.
We've started the work of adding proper healthchecks for L4 balancing. Perhaps we can use that as a way to remove nodes that are bad choices? With the OnlyLocal feature, that already happens - we only route to nodes that actually have backends running. Maybe that is enough?
We've had some discussions about returning "cost" in health checks. Maybe that goes further?
@boweidu @nicksardo @freehan
On Thu, May 4, 2017 at 2:47 PM, Eric Tune notifications@github.com wrote:
@kubernetes/sig-network-feature-requests https://github.com/orgs/kubernetes/teams/sig-network-feature-requests
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/45234#issuecomment-299318632, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVN8vDId3ddyrg-XKnJYWF8EJPUKwks5r2keMgaJpZM4NOcsG .
Some thoughts:
I just can't think of a use case where a service should ever send traffic to a node for a pod that the pod is not allowed to run on other than the situation where service's selector matches multiple deployments.
@thockin I think you @ wrong bowei :)
@jkemp101 I understand your point, and I don't even disagree with it, but I'm not sure I see how to implement it cleanly. If you use OnlyLocal you will get this behavior today (in exchange for a different set of risks). I could maybe see using healthchecks for this, but to do that, every node would have to watch all pods, which is pretty much a non-starter (too expensive).
It's tricky to do generically because we don't really know the implementation details of any given installation - whether they use healthchceks, whether they LB to nodes or pods, etc. I'm open to clever ideas, but I don't think this should be in the user's domain..
@thockin Are you leaning towards approach 1?
Approaches:
I'm fine with the second option. I don't see it different from having to pick the nodes for a deployment with a selector. This also allows special cases like routing all web traffic through Intrusion Detection System (IDS) (i.e. snort) nodes (this goes against some of my previous statements). And things would work the same way they work today if this additional selector wasn't specified with traffic going to all nodes in the cluster.
I'm against option 2 until option 1 has been explored and decided to be prohibitive.
On Fri, May 5, 2017 at 1:14 PM, Joe Kemp notifications@github.com wrote:
@thockin https://github.com/thockin Are you leaning towards approach 1?
Approaches:
- There is no reason to route traffic to nodes that would never have a pod running on it. Therefore we should automatically figure out the correct nodes and take care of all of the details behind the scenes for the user. This is probably a lot of work.
- Adding an additional "selector" to services would work but adds yet more complexity the user would have to deal with. Probably not as difficult to implement as the first option.
I'm fine with the second option. I don't see it different from having to pick the nodes for a deployment with a selector. This also allows special cases like routing all web traffic through Intrusion Detection System (IDS) (i.e. snort) nodes (this goes against some of my previous statements). And things would work the same way they work today if this additional selector wasn't specified with traffic going to all nodes in the cluster.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/45234#issuecomment-299564792, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVD3RLOKjippbmD3E6F_xu4PwC3PAks5r24MdgaJpZM4NOcsG .
Has anything become clearer on this issue? As of right now we are managing our load balancers manually via Terraform as a work around but would be much cleaner to be able to manage it through k8s itself.
This is still an issue for me. I had planned to run two identical clusters for redundancy but unfortunately had to make one of the clusters run nodes of all the same type so that I could ensure our user facing web requests always went through a node with sufficient network bandwidth.
I'm currently trying to understand some 504 errors related to AWS Classic ELBs and k8s services. Once I figure that out and determine if I need to switch to ALBs I plan to checkout the ALB ingress controller and see how that would relate to this issue.
There's effectively 0% chance this is solved in 1.8. If we want any chance to solve it in 1.9, we'll need to consider options early in the cycle. I don't like putting this in the user's hands. This selector doesn't apply to all LB implementations, which makes it something of a kludge. I'd prefer we explore other options that apply more widely.
I would like this also. We already run multiple clusters for QA, Dev and Prod. Adding another IMHO to separate traffic between ELBs is not really a nice solution.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Prevent issues from auto-closing with an /lifecycle frozen
comment.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or @fejta
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten /remove-lifecycle stale
/remove-lifecycle rotten
I'd like to wake this one back-up. Having all nodes participate in all services produces security challenges that prevent more concrete partitioning of networking e.g. dividing nodes between VPCs, ACLs etc., walling off masters from the rest of the world etc.
This is still a priority for me but I have recently started building clusters using the alb controller. I haven't had time to investigate but I was curious if ingresses provided enough information that perhaps I could add this logic to something like the alb controller to direct traffic more appropriately. In any case ingresses and external provisioners like the alb controller add a new dimension to this problem compared to when I started. And at this point I don't really care about k8s provisioned classic load balancers (CLBs) because everything needs to move to ALBs.
I think I care about this at two levels.
not all nodes should present a service IP. Reasoning that I should be able to partition and shape my service traffic to specific nodes, with specific capacity, and specific security boundaries.
the LB should be bound to only the nodes that are presenting the service IP. Such that only those specific nodes and ports in 1) are bound to the LB. Makes the security guys happy.
Yes , Why should LB talk to a node which does not have a service. It should not. We have a case where we see varying response time between requests because sometimes the requests go through a server which is not very healthy?
@thockin Looks like a different approach for this issue was taken in https://github.com/kubernetes/kubernetes/issues/54743 and also starting to be discussed in https://github.com/coreos/alb-ingress-controller/issues/325. Is the plan to move forward with marking nodes where services should not send traffic versus having the ability to configure services to route traffic to the desired nodes? I feel like workarounds are being implemented that are going to affect the long term complexity of this issue.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
Is it possible to limit the number of nodes being added to a LB? I've noticed that I will see 504 Gateway Timeouts on my service ELBs when cluster autoscaling adds/removes a bunch of new nodes. I'm not sure what's going on here but it feels like my service LBs are more reliable when nodes are not constantly being added and removed.
/reopen
@daviddyball: You can't reopen an issue/PR unless you authored it or you are a collaborator.
I think this is still important. In heterogeneous clusters with different node types and workloads, I don't want my ELB traffic passed back to a node that might have a higher than normal workload and isn't even hosting the service that the ELB fronts.
AWS EKS. 2 nodes group trough 2 different CloudFormation stacks. One stack in private subnets, second in public. All nodes in private subnets stack automatically labelled as "subnets: private". All nodes in public subnets stack automatically labelled as "subnets: public". Private subnets nodes stack: 2-10 nodes for application. Public subnets nodes stack: 2-3 nodes for LB (Traefik) only.
Would be nice to have some kind of node selection for LB services. I this case we could limit access to 80, 443 (for example) to Public subnets nodes by AWS Security group. Imagine that you need dedicated external IP for each application. Would be more secure to attach it to Public subnets nodes with Traefik only.
Is it implemented yet? It's a much needed feature
@jkemp101 would you re-open this? I agree with @jibsonline -- this is important. I am still dealing with a lot of 503s on my large+busy clusters because of this.
/reopen
@jkemp101: Reopened this issue.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
Any updates on this?
Can we reopen this? Our cluster contains both normal nodes and preemptible nodes (https://cloud.google.com/compute/docs/instances/preemptible). Load balancers are also using preemptible nodes while we only want them to use stable nodes.
This is still relevant - is there any other way to do this? Besides the numerous valid use cases above, we have a case where a network issue with a node group -- that has nothing to do with the service -- and which cannot be rotated at a moment's notice because it contains instance-store databases, is causing frequent 504s on a service.
/reopen please consider this feature.
/reopen please consider this feature.
/reopen
@jkemp101: Reopened this issue.
@jkemp101: This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Can we go back to first-concepts on this? The ecosystem has evolved a lot since 2017 ;)
People who are reopening it - what is your problem statement?
For the use-case of not using preemptible VMs to do LB traffic - can you use the node.kubernetes.io/exclude-from-external-load-balancers: "true"
annotation?
What else?
It seems this feature was introduced in version 1.19:
Search the next in the announce:
service.beta.kubernetes.io/aws-load-balancer-target-node-labels annotation to target nodes in AWS LoadBalancer Services
Detailed doc: https://v1-19.docs.kubernetes.io/docs/concepts/services-networking/service/#other-elb-annotations
It seems this feature was introduced in version 1.19:
Search the next in the announce:
service.beta.kubernetes.io/aws-load-balancer-target-node-labels annotation to target nodes in AWS LoadBalancer Services
Detailed doc: https://v1-19.docs.kubernetes.io/docs/concepts/services-networking/service/#other-elb-annotations
But only for AWS, using an AWS specific annotation.
Correct, but since 2017...something is something!
node.kubernetes.io/exclude-from-external-load-balancers: "true"
is in 1.19, that is correct. It should work for all providers.
I do apologize for this bug being open so long, but the original formulation wasn't viable (still isn't) and since so much time has passed and the thinking has changed around this topic, I'd really like to refresh the problem statement.
I'm not directly involved in this part of k8s anymore. So I'm fine closing this issue and someone can create a new one that is relevant to their specific issue and can address limitations of the current versions of k8s and their capabilities. Our own architecture has evolved significantly since I created this issue and so we aren't directly impacted by this anymore.
I'll leave this open a bit, and if we don't hear new requirements, we will close. We can always reopen.
OK, I'll close it now. If someone stumbles on this and has new input, please respond!
There have been a number of issues/PRs to adjust the service controller's behavior with respect to the nodes it selects to add to a LoadBalancer. Many of these issues have focused on keeping the masters out of the load balancers which was an important first step. However, there are other use cases that are still not met with the current functionality.
Current issues:
Functional requirements:
Approach: