kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.37k stars 39.47k forks source link

Requirements for taints, tolerations, dedicated nodes to graduate to Beta and then v1 #25320

Closed davidopp closed 8 months ago

davidopp commented 8 years ago

We'd like to graduate taints, tolerations, and dedicated nodes to Beta in 1.4. (Assuming we get them into 1.3 as alpha.)

Taints/tolerations to Beta

Taints/tolerations to v1

Dedicated nodes to Beta

cc/ @kevin-wangzefeng @bgrant0607

ref/ #24134 #17190 #18263

resouer commented 8 years ago

Working on node side to support kubelet enforced feature of taints/toleration. This requires a pod admitter in the future which is in sig-node TODO list.

resouer commented 8 years ago

@davidopp Cloud you explain why add forgiveness to Toleration? If we don't want to tolerate some taints, just remove toleration is not enough?

davidopp commented 8 years ago

The canonical example for forgiveness is: let's say we represent node problems using taints (which we actually do plan to do). So for example Then a pod can use toleration to represent how long it wants to stay bound to a nod before it should be evicted. You want to add the taint immediately when the node starts having the problem, because some pods might want to be evicted right away. But the ones that don't want to be evicted right away, can use toleration to specify how long they want to stay bound. (Of course this assumes the TaintEffect is TaintEffectNoScheduleNoAdmitNoExecute.)

BTW @kevin-wangzefeng is going to work on adding forgiveness to Toleration.

resouer commented 8 years ago

Thanks, its much clearer.

hongchaodeng commented 8 years ago

@davidopp

I would to help implement some tasks for v1.4. Can your mark tasks being taken by somebody already? Or can folks who is taking over mark themselves here?

BTW, what's "proper defaulting"?

davidopp commented 8 years ago

Yes, sorry, to clarify, we haven't done any planning for 1.4, and we intend to do it jointly with the community when it happens. But @kevin-wangzefeng implemented taints and tolerations (#24134) and had already told me he was planning to add forgiveness since it was part of the original design.

resouer commented 8 years ago

@hongchaodeng Except for forgiveness and what I am working on, e.g. NoScheduleNoAdmit NoScheduleNoAdmitNoExecute, it seems other tasks are all available. Would be great to see if you can pick some up.

lukaszo commented 8 years ago

Is anyone working on Move to GeneralPredicates ? If no, I can take it.

davidopp commented 8 years ago

@lukaszo I don't think anyone is working on it.

kevin-wangzefeng commented 8 years ago

Some updates:

eparis commented 8 years ago

@davidopp is this list still correct(ish) for getting to beta? Is there any better place to look?

davidopp commented 8 years ago

I think this list is more or less accurate. @kevin-wangzefeng is working on some of the items for 1.5. I don't think we will have time to make it Beta in 1.5 though. We could maybe split this up into taints/tolerations and dedicated nodes, and consider the latter a separate feature; that would allow us to remove the security-related items from the list (like ensuring Kubelet comes up with the right taints when it joins, and preventing unprivileged users from modifying taints/toleations). Also we definitely don't need NoScheduleNoAdmit for Beta, as it can be added in a backward-compatible way.

timothysc commented 8 years ago

/cc @ConnorDoyle @jayunit100

davidopp commented 8 years ago

Figure out what to do with NodeStatus.NodeSpec "unschedulable" field

aveshagarwal commented 8 years ago

@kevin-wangzefeng @davidopp Could you please share what specific items kevin is working on for 1.5, any links or anything for reference? Also I am happy to help on other items not being worked on by kevin, perhaps to help it move closer to beta.

@derekwaynecarr

timothysc commented 8 years ago

@aveshagarwal please coordinate via the @kubernetes/sig-scheduling to ensure folks aren't overlapping. xref: http://goo.gl/CsXJLu

aveshagarwal commented 8 years ago

I am trying to summarize the tasks here to capture the current state based on what I have found as its not clear what is being worked on, is already done or is yet to do. As @davidopp suggested to separate taints/tolerations and dedicated nodes features in one of his previous comments, I am updating the list for taints/tolerations and dedicated nodes features to graduate to beta here (please feel free to correct me, and also update as needed ):

Taints/Toleration to Beta:

To-do:

Being worked on/or done:

Taints/tolerations to v1:

Dedicated nodes to Beta

We can discuss more about it in today's sig-scheduling meeting.

aveshagarwal commented 8 years ago

/cc @derekwaynecarr

aveshagarwal commented 8 years ago

@kubernetes/huawei

aveshagarwal commented 8 years ago

@davidopp @kevin-wangzefeng As discussed in the yesterday's sig-scheduling meeting, would like to have feedback on my last comment or in general.

timothysc commented 8 years ago

@aveshagarwal things to keep in mind re(alpha-beta) : https://github.com/kubernetes/kubernetes/issues/30819#issuecomment-249877334

davidopp commented 8 years ago

Sorry for the delay. I think my list was way too long. Those are all features we want eventually but I don't think they're all needed for 1.5.

The remaining things I think we need for 1.5 are

@kevin-wangzefeng is working on these and said he should have a PR out soon.

As for graduating it to Beta, we should do that in a followup PR. I hope we can get it done for 1.5. I think we need to implement "make toleration key wildcardable" before (or simultaneous with) graduating to Beta because I'm not sure we can add it later in a backward compatible way. (imagine someone creates an alpha toleration with key X and then we decide that in beta key X means wildcard; then when they upgrade from alpha to beta their toleration suddenly means something new)

There is discussion ongoing in #30819 about how to graduate things to Beta in general.

We can ignore dedicated notes and just consider it a separate feature to be implemented in the future.

aveshagarwal commented 8 years ago

I can take a stab at "make toleration key wildcardable" implementation if no one is working on it. Here is my understanding.

When a toleration key is wild card: Case 1: Operator=Equal, Value=v, Effect=one of NoSchedule, PreferNoSchedule, NoExecute, NoScheduleNoAdmit

It means that a pod with the above toleration could be scheduled to any node that has a taint with value v (irregardless of key) according to the effects.

Case 2: Operator=Exits, Value=v, Effect=one of NoSchedule, PreferNoSchedule, NoExecute, NoScheduleNoAdmit

In this case, the value is wildcard (As Operator=Exits) too. As both key and value are wild cards, it seems equivalent to no toleration.

davidopp commented 7 years ago

I think @kevin-wangzefeng may have started working on it, so you should check with him.

What I proposed was that empty toleration key (empty string) means "match all keys" I think it would only make sense to use it with Exists operator (so it means "match all values and all keys"), so we would check that in validation.

aveshagarwal commented 7 years ago

@davidopp thanks, its ok with me if @kevin-wangzefeng is working on it, so I will leave it to him.

Is someone working on "NoScheduleNoAdmit" implementation too, as I haven't seen it any PR? In case no, I would be happy to help with it.

kevin-wangzefeng commented 7 years ago

@aveshagarwal I think it's #34400 (previously #26501), but I didn't get time to take a look yet.

aveshagarwal commented 7 years ago

@kevin-wangzefeng Yes https://github.com/kubernetes/kubernetes/pull/34400 implements NoScheduleNoAdmit.

aveshagarwal commented 7 years ago

Is this the expectation with NoScheduleNoAdmit?

a) NoSchedule part of NoScheduleNoAdmit is exactly similar to existing NoSchedule and is for pods going via scheduler.

b) NoAdmit part of NoScheduleNoAdmit is for static pods (created directly by kubelet, i..e the pods not going through the scheduler).

davidopp commented 7 years ago

Unfortunately we're not going to have time to get #34400 into 1.5.

But as for what you described @aveshagarwal, it is correct except I'm not sure the term "static pods" is correct (or maybe I just don't understand the definition you're using). But as you said, it is for pods that are submitted directly to Kubelet, e.g. by creating a pod with NodeName already filled in.

aveshagarwal commented 7 years ago

@davidopp I just picked the word static pod from here: http://kubernetes.io/docs/admin/static-pods/

davidopp commented 7 years ago

Yeah I'm not sure whether NoAdmit would affect static pods. The intended use case was a regular pod that sets NodeName directly. Static pods are a bit different as they are created from a manifest file (and in theory by HTTP but nobody does that).

aveshagarwal commented 7 years ago

I have copied the list from https://github.com/kubernetes/kubernetes/issues/35518 here for taints/tolerations too :

  1. Remove the existing implementation in annotations.
  2. Implement existing stuff in api fields (xref: https://github.com/kubernetes/kubernetes/issues/34508).
  3. Port tests.

As we discussed in yesterday's sig-sched meeting, I will start looking into it once the existing PRs are in better/working shape probably for 1.6 time frame.

Once functionality has reached parity with the annotation implement, promote from alpha->beta.

@kubernetes/sig-scheduling thoughts?

timothysc commented 7 years ago

@aveshagarwal I'd like to sync sometime on your test plan(s) around this. Right now everything is lite at best.

but +1 to the plan.

Quentin-M commented 7 years ago

We have an use-case where, we'd like a pod managed by a DaemonSet to detect when the underlying node is being put in maintenance mode. Currently, this means kubectl cordon or kubectl drain are ran and set the NodeStatus.NodeSpec.Unschedulable field. But because the pod is part of a DaemonSet, it is not destroyed and there is no way to differentiate between the two operations using the API. However, it actually has quite some importance to us because in the case where the node is drained, it shows the operator's intent to shutdown/restart the node (or realize some other heavy duty task), in which case we want to gracefully migrate our workload/tasks/resources to another working pod/node. In the other hand, if the node is simply cordoned, the risk is limited (pods restarting is fine) and it is not worth migrating.

Unfortunately, as mentioned above, we currently can't tell and therefore, we are forced to migrate everything as soon as the node is set as unschedulable, even if it's a very short maintenance/outage.

It seems that the NoScheduleNoAdmit and NoExecute taints fit the use-case. However, I didn't clearly see any mention about how kubectl cordon and kubectl drain will be reworked to adopt these. Intuitively, I'd say that kubectl cordon would set NoScheduleNoAdmit while kubectl drain would set NoScheduleNoAdmit + NoExecute, without actively deleting any pods but instead letting the schedule do it. Tolerations and Forgiveness could then be used to make my DaemonSet's pods stop and execute the necessary migration steps using the preStop hook - While at the same time, DaemonSets such as kube-proxy would tolerate these taints and keep running indefinitely.

Does it make sense? Am I missing anything? I believe these are planned for 1.6?

aveshagarwal commented 7 years ago

@Quentin-M yes the current plan for NoScheduleNoAdmit and NoExecute implementation is 1.6. Here are related PRs: https://github.com/kubernetes/kubernetes/pull/34825 , https://github.com/kubernetes/kubernetes/pull/34400 .

davidopp commented 7 years ago

Each pod in a DaemonSet is supposed to be tied to the node it is running on. There shouldn't be any need to migrate anything when the machine goes down. So it sounds like you are using DaemonSet in an unexpected way.

But anyway, what you and @aveshagarwal is indeed the plan; we will switch cordon/uncordon/drain to use taints, and then DaemonSet pods can decide for themselves whether they want to be deleted when these actions happen. We were hoping to get this into 1.5 but it didn't make it, but it should be in 1.6. #29178 covers the "represent unschedulable as a taint" part; i don't think we have an issue filed for making kubectl drain use taints for eviection but we can just leave your comment here about it as a reminder.

davidopp commented 7 years ago

Actually I don't think we can use taints for kubectl drain, because we want to respect PodDisruptionBudget when doing the evictions for a drain, and eviction due to NoExecute taint presumably won't. I could imagine some scheme where the NoExecute taint could specify whether the eviction should respect PDB (go through pod /eviction subresource) or not, but this seems complicated. So I don't think eviction for taint will respect PDB. If we find some important use cases where we do need it, then we could do something like what I described, but it's complication we should avoid unless it's truly necessary.

davidopp commented 7 years ago

We should consider whether to add taint/toleration check to GeneralPredicates (so Kubelet checks them) when moving to Beta.

sdminonne commented 7 years ago

/cc

davidopp commented 7 years ago

We need to write documentation. The documentation should emphasize that this is not really intended to be used by users directly, but by higher-level tools that would apply the appropriate taints and configure an admission controller to add the appropriate tolerations. We should also mention that you need to pair a toleration with a node affinity in some scenarios.

davidopp commented 7 years ago

Once we think the work is done, we need to verify that no instances remain in the codebase of the strings "TolerationsAnnotationKey", "api.TolerationsAnnotationKey", "scheduler.alpha.kubernetes.io/tolerations", "TaintsAnnotationKey", "api.TaintsAnnotationKey", "scheduler.alpha.kubernetes.io/taints"

aveshagarwal commented 7 years ago

@davidopp Yeah sure. Though this PR https://github.com/kubernetes/kubernetes/pull/38957 has already removed them. So I will update this PR after forgiveness related work is merged.

davidopp commented 7 years ago

ref/ #41172 #39469 #39914 #40355 #41014 #41068 #41073 #41133

davidopp commented 7 years ago

Remaining work to move taints & tolerations to Beta:

Technically the last two bullets are not necessary for moving these features to Beta but it is closely related and #41133 is a lot of churn so it will be easier to get that in if we wait with #38957 until after it goes in.

Upgrading to 1.6 will break existing users of these features and they will have to switch to fields, but this within the bounds of our policy for alpha features. Information about upgrading has been sent to kubernetes-dev (here) and kubernetes-users (here) and will also appear in the release notes.

resouer commented 7 years ago

@davidopp Should I now begin to continue the implementation TaintEffectNoScheduleNoAdmit in #34400? Since we are aiming it at 1.6 ref https://github.com/kubernetes/kubernetes/issues/25320#issuecomment-261571143

If so, I am going to update #34400 very quickly.

davidopp commented 7 years ago

@resouer I don't think we should do NoScheduleNoAdmit for 1.6. It is too much changing in one release and I have not heard any strong requests for it from customers (of Google or any of our partners). It is definitely useful for some scenarios (e.g. "protect" the kubelet when pod is not scheduled by default scheduler) but I think the risk outweighs the benefit for putting it into 1.6. I think it would be better to target 1.7.

If there has been strong request from customers you are aware of, please let me know and we could reconsider. But I am already nervous about the amount of scheduling stuff that we are enabling in 1.6.

resouer commented 7 years ago

@davidopp That's OK, just read this comment and why I'm asking:

@Quentin-M yes the current plan for NoScheduleNoAdmit and NoExecute implementation is 1.6. Here are related PRs: #34825 , #34400 .

Let's keep our plan as it is.

davidopp commented 7 years ago

Yeah, I am not sure the background for that comment. Maybe at one point we said we were going to do it for 1.6 (I can't remember), but I think that in light of all of the other changes it would be safer to delay it to 1.7. Sorry about that.

gmarek commented 7 years ago

There's a subbullet for the last checkbox:

davidopp commented 7 years ago

Oh yes -- thanks for remembering that! Will you have time to do that? Or do you think @kevin-wangzefeng could do it? (He has written an admission controller before.)