Open danwinship opened 4 weeks ago
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: danwinship
The full list of commands accepted by this bot can be found here.
The pull request process is described here
completely agree that we should do it, have concerns about if we should overload the term instead of doing a new one
It's not even overloading though; PreferClose
already means this:
// Indicates a preference for routing traffic to endpoints that are
// topologically proximate to the client. The interpretation of "topologically
// proximate" may vary across implementations and could encompass endpoints
// within the same node, rack, zone, or even region. Setting this value gives
// implementations permission to make different tradeoffs, e.g. optimizing for
// proximity rather than equal distribution of load. Users should not set this
// value if such tradeoffs are not acceptable.
ServiceTrafficDistributionPreferClose = "PreferClose"
The consensus in #4445 seemed to be in favor of vaguely-defined hints, interpreted broadly, rather than precisely-defined alternate semantics.
The consensus in #4445 seemed to be in favor of vaguely-defined hints, interpreted broadly, rather than precisely-defined alternate semantics.
don't disagree on that, but I expect the interpretation to be always the same for the same implementation, and not different depending on factors like is a daemonset, or it has less than 3 endpoints, it is a slippery slope in my opinion ... and this will make it more complex to debug, explain, troubleshoot, ...
I expect the interpretation to be always the same for the same implementation
AIUI this is exactly how it's not supposed to work. The hints are vague specifically so that implementations can get smarter in the future.
AIUI this is exactly how it's not supposed to work. The hints are vague specifically so that implementations can get smarter in the future.
the previous implementation was "smart" and turned out to be very confusing for users, that is why IIRC we moved to something simpler and more deterministic solution.
I may have lost track on these details and I recognize I don't remember most of the details, but @gauravkghildiyal and @robscott should be the most knowledgeable , but I expect PreferClose
to work always the same in kube-proxy and add a new mode of traffic distribution for this
You mean Topology Keys? That was designed to let the user be "smart" and required the proxy to be correspondingly "dumb". (The proxy had to do exactly what the user said, even if it knew that would have bad effects). TrafficDistribution, like Topology-Aware Hints, was designed to allow the proxy to be smart, while also allowing the user to give hints about what they think would be smart for the proxy to do.
I'm open to the argument that we shouldn't change PreferClose to include prefer-same-node because we think the heuristic I suggested doesn't sufficiently capture the user's intent. But if you don't think it would ever be legitimate to change "PreferClose" to include prefer-same-node at all, then we should deprecate "PreferClose" and rename it to "PreferSameZone". The entire point of having the name be vague was that it wasn't supposed to refer to any specific behavior. If we think we can't get away with changing the behavior, then we should give up on the vagueness idea.
Maybe I should try to figure out the
TopologySpreadConstraints
version of this...
(TopologySpreadConstraints
is a feature for telling the scheduler how to schedule pods to nodes in a way that ensures they are evenly distributed relative to some topological constraint (eg, host, zone, region, etc). It is fairly complicated, but the complication is all in the "figuring out when you can and can't schedule a pod" part. The actual topology definition is a single field, and that's all we'd need to look at here.)
So this would be something like:
Add a new ForNodeSelector *metav1.LabelSelector
field to discoveryv1.EndpointHints
.
If a Service has trafficDistribution: PreferClose
(or maybe trafficDistribution: TopologyConstrained
) then for each endpoint Pod that has .spec.topologySpreadConstraints
(which in the normal case would be all of them), the EndpointSlice controller will set the forNodeSelector
field of the corresponding Endpoint to an appropriate selector based on the contraints' topologyKey
values. Eg, if there is an endpoint pod on "node1", with a single TopologySpreadConstraint
that has topologyKey: "kubernetes.io/hostname"
, then the corresponding Endpoint would get:
hints: forNodeSelector: matchLabels: "kubernetes.io/hostname": node1
If kube-proxy sees an endpoint with the forNodeSelector
hint set (and topology is enabled), then it would check if the node selector applies to its own host, and ignore the endpoint otherwise (unless this resulted in no endpoints being available, in which case it would ignore the forNodeSelector
hints for this service).
(Theoretically we could make the hints more compact by having a single top-level slice.hints.forTopologyKeys: ["kubernetes.io/hostname"]
rather than exploding that out into per-endpoint nodeSelectors. This would require the service proxy to keep track of the labels of all nodes rather than only its own node, so that if it sees slice.hints.forTopologyKeys: ["topology.kubernetes.io/region"]
it can figure out which nodes have the same region as its own node.)
From kube-proxy's perspective, this is basically the same as Topology Keys, but the difference is that you only get to use it if you've also told the scheduler about the plan, so it avoids the problem of "you said to prefer-same-zone but there is only 1 endpoint in this zone (and 10 in another)" which Topology Keys was vulnerable to.
One downside of this approach is that DaemonSets don't actually use topologySpreadConstraints
... So you'd have to redundantly specify a dummy TSC as part of your DaemonSet pod template... Or maybe we'd say that trafficDistribution: TopologyConstrained
applied to either pods using TopologySpreadConstraints
or pods deployed by a DaemonSet.
Previously introduced as
internalTrafficPolicy: PreferLocal
in #3016 and then reworked as "node-level topology" in #3293 (which was closed before we came up with any specific API).The goal here is "pods should connect to the local CoreDNS endpoint, if there is one".
Given that "PreferClose" has an intentionally vague name (which, in particular, does not mention "zones"), it seemed to make more sense to add prefer-same-node semantics to "PreferClose" than to add a new traffic distribution ("PreferREALLYClose"?) for this. The proposed rule for deciding when prefer-same-node can be used is:
(Bringing this up now because I have reason to want to kill off our downstream "prefer same node just for CoreDNS" patch again.)
I'm not sure if we'd need a new feature gate for this new behavior?
In the past, it has been suggested that this is what NodeLocal DNSCache is for, but NodeLocal DNSCache is just a bad partial reimplementation of Services, to make up for the fact that the real Service API doesn't have a prefer-same-node feature... (Among the reasons that it is bad: it doesn't support fallback-to-other-nodes when the DNS pods are being upgraded.)
/cc @aojea @thockin @gauravkghildiyal @robscott