kubernetes / enhancements

Enhancements tracking repo for Kubernetes
Apache License 2.0
3.45k stars 1.49k forks source link

KEP-4444: add "prefer same node" semantics to PreferClose #4931

Open danwinship opened 4 weeks ago

danwinship commented 4 weeks ago

Previously introduced as internalTrafficPolicy: PreferLocal in #3016 and then reworked as "node-level topology" in #3293 (which was closed before we came up with any specific API).

The goal here is "pods should connect to the local CoreDNS endpoint, if there is one".

Given that "PreferClose" has an intentionally vague name (which, in particular, does not mention "zones"), it seemed to make more sense to add prefer-same-node semantics to "PreferClose" than to add a new traffic distribution ("PreferREALLYClose"?) for this. The proposed rule for deciding when prefer-same-node can be used is:

  * the Service has `PreferClose` traffic distribution, and
  * the Service has endpoints on all or all-but-one of the Ready nodes, and
  * the endpoints of the Service are Pods that have `OwnerReferences` indicating
    that they are all part of the same DaemonSet.

(Bringing this up now because I have reason to want to kill off our downstream "prefer same node just for CoreDNS" patch again.)

I'm not sure if we'd need a new feature gate for this new behavior?

In the past, it has been suggested that this is what NodeLocal DNSCache is for, but NodeLocal DNSCache is just a bad partial reimplementation of Services, to make up for the fact that the real Service API doesn't have a prefer-same-node feature... (Among the reasons that it is bad: it doesn't support fallback-to-other-nodes when the DNS pods are being upgraded.)

/cc @aojea @thockin @gauravkghildiyal @robscott

k8s-ci-robot commented 4 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[keps/sig-network/OWNERS](https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/OWNERS)~~ [danwinship] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
aojea commented 4 weeks ago

completely agree that we should do it, have concerns about if we should overload the term instead of doing a new one

danwinship commented 4 weeks ago

It's not even overloading though; PreferClose already means this:

// Indicates a preference for routing traffic to endpoints that are
// topologically proximate to the client. The interpretation of "topologically
// proximate" may vary across implementations and could encompass endpoints
// within the same node, rack, zone, or even region. Setting this value gives
// implementations permission to make different tradeoffs, e.g. optimizing for
// proximity rather than equal distribution of load. Users should not set this
// value if such tradeoffs are not acceptable.
ServiceTrafficDistributionPreferClose = "PreferClose"

The consensus in #4445 seemed to be in favor of vaguely-defined hints, interpreted broadly, rather than precisely-defined alternate semantics.

aojea commented 4 weeks ago

The consensus in #4445 seemed to be in favor of vaguely-defined hints, interpreted broadly, rather than precisely-defined alternate semantics.

don't disagree on that, but I expect the interpretation to be always the same for the same implementation, and not different depending on factors like is a daemonset, or it has less than 3 endpoints, it is a slippery slope in my opinion ... and this will make it more complex to debug, explain, troubleshoot, ...

danwinship commented 4 weeks ago

I expect the interpretation to be always the same for the same implementation

AIUI this is exactly how it's not supposed to work. The hints are vague specifically so that implementations can get smarter in the future.

aojea commented 4 weeks ago

AIUI this is exactly how it's not supposed to work. The hints are vague specifically so that implementations can get smarter in the future.

the previous implementation was "smart" and turned out to be very confusing for users, that is why IIRC we moved to something simpler and more deterministic solution.

I may have lost track on these details and I recognize I don't remember most of the details, but @gauravkghildiyal and @robscott should be the most knowledgeable , but I expect PreferClose to work always the same in kube-proxy and add a new mode of traffic distribution for this

danwinship commented 3 weeks ago

You mean Topology Keys? That was designed to let the user be "smart" and required the proxy to be correspondingly "dumb". (The proxy had to do exactly what the user said, even if it knew that would have bad effects). TrafficDistribution, like Topology-Aware Hints, was designed to allow the proxy to be smart, while also allowing the user to give hints about what they think would be smart for the proxy to do.

I'm open to the argument that we shouldn't change PreferClose to include prefer-same-node because we think the heuristic I suggested doesn't sufficiently capture the user's intent. But if you don't think it would ever be legitimate to change "PreferClose" to include prefer-same-node at all, then we should deprecate "PreferClose" and rename it to "PreferSameZone". The entire point of having the name be vague was that it wasn't supposed to refer to any specific behavior. If we think we can't get away with changing the behavior, then we should give up on the vagueness idea.

danwinship commented 3 weeks ago

Maybe I should try to figure out the TopologySpreadConstraints version of this...

(TopologySpreadConstraints is a feature for telling the scheduler how to schedule pods to nodes in a way that ensures they are evenly distributed relative to some topological constraint (eg, host, zone, region, etc). It is fairly complicated, but the complication is all in the "figuring out when you can and can't schedule a pod" part. The actual topology definition is a single field, and that's all we'd need to look at here.)

So this would be something like:

(Theoretically we could make the hints more compact by having a single top-level slice.hints.forTopologyKeys: ["kubernetes.io/hostname"] rather than exploding that out into per-endpoint nodeSelectors. This would require the service proxy to keep track of the labels of all nodes rather than only its own node, so that if it sees slice.hints.forTopologyKeys: ["topology.kubernetes.io/region"] it can figure out which nodes have the same region as its own node.)

From kube-proxy's perspective, this is basically the same as Topology Keys, but the difference is that you only get to use it if you've also told the scheduler about the plan, so it avoids the problem of "you said to prefer-same-zone but there is only 1 endpoint in this zone (and 10 in another)" which Topology Keys was vulnerable to.

One downside of this approach is that DaemonSets don't actually use topologySpreadConstraints... So you'd have to redundantly specify a dummy TSC as part of your DaemonSet pod template... Or maybe we'd say that trafficDistribution: TopologyConstrained applied to either pods using TopologySpreadConstraints or pods deployed by a DaemonSet.