kubernetes / enhancements

Enhancements tracking repo for Kubernetes
Apache License 2.0
3.39k stars 1.46k forks source link

PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode #3990

Open sanposhiho opened 1 year ago

sanposhiho commented 1 year ago

Enhancement Description

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

/sig scheduling /assign

sanposhiho commented 1 year ago

/sig autoscaling

knelasevero commented 1 year ago

@sanposhiho do you have ETA for the KEP to be up? Any way we could help?

/cc @a7i

sanposhiho commented 1 year ago

I recently started a draft locally, but either way, as written at the top, we - sig-scheduling - don't plan to have this enhancement in v1.28 and gonna be in v1.29 at the earliest. You can help us improve the design once I've created the PR for KEP.

sanposhiho commented 1 year ago

@ahg-g @alculquicondor @Huang-Wei Can we target this towards v1.29? I'll write the KEP + take the implementation part, but looking for main reviewer(s) who can work on in the next release cycle.

alculquicondor commented 1 year ago

I'm in favor, but make sure you also have a reviewer from sig-autoscaling

sanposhiho commented 1 year ago

Thanks @alculquicondor.

@gjtempleton @mwielgus Hello. Could either of you help us as a reviewer on this enhancement? The cluster autoscaler is relevant to this enhancement, like CA tells the scheduler that it cannot bring any new Nodes in and the scheduler decides to fallback. (See https://github.com/kubernetes/kubernetes/issues/105977 for more details)

MaciekPytel commented 1 year ago

Hi, I've been thinking about this for a long time and as a top-level owner of Cluster Autoscaler I'd be happy to get involved.

That being said - I'd like to start discussion with a possibly very controversial question: is scheduling the best layer to address zonal spreading?


Problem statement

Today I would recommend against anyone using PodTopologySpreading on a zonal topology in a cloud environment:

My understanding is that the goal of this proposal is to address the problems with whenUnsatisfiable: DoNotSchedule spreading that I described above. Am I correct?


Challenges

I'd love to see this solved, but I don't have any good ideas on how to do it. A timeout-based approach seems very fragile as node provisioning times vary widely between clouds and even within a single cloud they may be very different based on the type of hardware being used (ex. nodes with GPU often take more time to startup and initialize). And what happens when preferred instance types are unavailable (ex. stockout) and autoscaler needs to fallback to a different type - that would add anywhere between a few seconds to 15+ minutes of extra latency depending on the cloud and the exact reason for node creation failure. We can say that it's up to the user to set the right timeout, but how many users will be able to set the right timeout in practice? I've been working on Cluster Autoscaler in OSS and GKE for >6 years and I have no idea what a good timeout value would be on GCP - much less in any other cloud.

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler, but for that I think we should have an idea how to support this in Autoscaler: how do we make it aware that it should update the pod at all? Today Autoscaler just imports scheduler code and runs PreFilters/Filters, without any understanding of what they actually check. How would it know that a pod is unschedulable because of TopologySpreading and not some other constraint? - and I mean the question in the sense of how much of Autoscaler would we have to refactor, not just a high-level conceptual answer.


Alternatives that may be worth discussing

Finally, even if we solve all of those issues - scheduling pods in a way that respects topology spreading constraint still wouldn't guarantee equal spreading between zones. If one zone is temporarily stocked-out, the scheduler will not be able to restore the spreading after the instances become available again.

That brings me to the controversial question: wouldn't it be better to solve this at Deployment/ReplicaSet/StatefulSet/etc level instead? Any such controller could target pod to specific zones (ex. by setting nodeSelector for different zones) and it could continuously reconcile the number of pods in each zone. This would also address the problem of timeout - we could fallback to a different zone after a relatively short timeout, knowing that we can always restore the spreading as soon as the capacity becomes available.

This is the approach taken by https://github.com/kubernetes/autoscaler/blob/master/balancer/proposals/balancer.md proposal. I'm not sure balancer is the best way to implement this either - I'm not as familiar with whatever challenges this approach may be facing. But I think it would be good to start discussion from agreeing on what problems we're trying to solve and evaluating which component could best solve them, before jumping into any particular implementation.

sanposhiho commented 1 year ago

Thanks @MaciekPytel for getting involved!

I believe we should continue to discuss a detailed design in KEP PR instead of here, but let me roughly answer your questions.

In the issue, we're considering having new whenUnsatisfiable, DoNotSchedule_BestEffort and PreemptOrSchedule. (names can be officially decided in KEP) DoNotSchedule_BestEffort is the one CA will be related. When DoNotSchedule_BestEffort, the scheduler keeps trying to schedule the Pod while assuming TopologySpread is DoNotSchedule. But, when the scheduler somehow knows that CA cannot create Node for this Pod, then fallback to ScheduleAnyway so that the Pod will get scheduled without being blocked by PodTopologySpread.

Then, next it comes to how to know that CA cannot create Node for the Pod. We discussed two options:

In the issue, we kind of concluded to prefer option 2, new Pod condition. So, CA's responsibility is to add false to new Pod condition TriggeredScaleUp. The reason was, as you said, it's difficult to set an appropriate timeout. (ref)

It's the current status of our discussion, and I'm going to create the KEP based on this. So, answering your questions

My understanding is that the goal of this proposal is to address the problems with whenUnsatisfiable: DoNotSchedule spreading that I described above. Am I correct?

Yes. Exactly correct.

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler

And yes it's exactly what we concluded (at least in the issue).

How would it know that a pod is unschedulable because of TopologySpreading and not some other constraint?

I believe, to make it simple, CA doesn't need to do something special for TopologySpread. It only needs to set the condition when it cannot create a node for unschedulable Pod, regardless of the reason. And when the scheduler sees that condition, fallback to SchedulingAnyway in anyways. If unschedulable isn't caused by TopologySpread, the Pod should still be unschedulable even after the fallback.

If one zone is temporarily stocked-out, the scheduler will not be able to restore the spreading after the instances become available again.

Such "rescheduling", "rebalancing" is the responsibility of descheduler, not the scheduler. So, we don't need to concern much about the rebalancing in the scheduler.

MaciekPytel commented 1 year ago

I believe we should continue to discuss a detailed design in KEP PR instead of here

Sounds good to me. Please tag me on PR and also please feel free to ping me on slack if you want to discuss any CA-releated parts.

sanposhiho commented 11 months ago

@alculquicondor @MaciekPytel Can we target this enhancement to v1.29? (based on your bandwidth?)

sanposhiho commented 11 months ago

We skipped this v1.29 release. Let's aim at the next one hopefully. We'll discuss it in an upcoming SIG-Autoscaling meeting. (Oct 23rd)

tzneal commented 11 months ago

This handles one part of the scheduling problem (an autoscaler is unable to launch new capacity) but doesn't handle the case where an autoscaler launches new capacity, but its degraded in some way (e.g. node is ready, but all pods that schedule to the new node fail due to some other sort of issue affecting the topology domain). Has there been any thoughts on allowing overriding scheduling restrictions during gray failures?

When thinking on it, I was considering a CRD that a user could create to indicate to the scheduler/autoscaler/anyone else that a particular topology domain is now invalid and shouldn't count for topology spread purposes, and no pods should be scheduled to that domain. Autoscalers could read the same CRD and avoid attempting to scale up nodes in that domain as well.

There are a few other advantages to being able to imperatively indicate to multiple consumers that a topology domain is bad:

ellistarn commented 11 months ago

Replicaset Spread

That brings me to the controversial question: wouldn't it be better to solve this at Deployment/ReplicaSet/StatefulSet/etc level instead? Any such controller could target pod to specific zones (ex. by setting nodeSelector for different zones) and it could continuously reconcile the number of pods in each zone. This would also address the problem of timeout - we could fallback to a different zone after a relatively short timeout, knowing that we can always restore the spreading as soon as the capacity becomes available.

I quite like this line of thinking. To add to it (though you may be implying this already), it could continue to be part of the podtemplatespec, but would result in the replicaset controller applying additional corresponding nodeSelectors to the physical pods it creates. You could even re-use the topology spread constraints API surface, and just shift the spread responsibilities to replicaset, instead of scheduler.

To be explicit:

  1. user creates a deployment, w/ topologyspreadconstraints in the podtemplatespec
  2. deployment creates a replicaset w/ the full podtemplatespec
  3. replicaset controller applies additional nodeSelectors to each pod, in an effort to enforce spread
  4. replicaset controller could continually reconcile pods to maintain a balance
  5. kube-scheduler (and autoscalers) simply don't need to think about topologyspreadconstraints

Scheduler / Autoscaler coordination

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler,

I've often wished that the scheduler and autoscaler were the same component, as it unlocks the ability to make and enforce decisions in the same component, avoiding race conditions like the this KEP attempts to address. Of course, there are ways to communicate these decisions between systems, but communication protocols are hard (Kube API Server objects, or otherwise). This is most likely a dead end, given where Kubernetes is today, but given that @MaciekPytel is opening up controversial questions, I figure I might throw this one into the ring ;)

Topology API Object

When thinking on it, I was considering a CRD that a user could create to indicate to the scheduler that a particular topology domain is now invalid and shouldn't count for topology spread purposes, and no pods should be scheduled to that domain. Autoscalers could read the same CRD and avoid attempting to scale up nodes in that domain as well.

This would be very useful to achieve usecases like "disable this AZ while we ride out this outage" https://github.com/aws/karpenter/issues/4727

sanposhiho commented 11 months ago

node is ready, but all pods that schedule to the new node fail due to some other sort of issue affecting the topology domain

The cluster autoscaler should take all scheduling constraints into consideration when they do the simulation. If it does properly, that kind of scenario (Node is created, but the unsched Pod cannot get scheduled to that Node actually) shouldn't happen. Or, in some minor scenario (like higher priority Pod is created very recently, after the CA's reconciliation), the Pod may be still unschedulable again after the first CA's reconciliation. But, that should be resolved in the next reconciliation of CA.

the scheduler/autoscaler/anyone else that a particular topology domain is now invalid

Giving taints to such Nodes in domain looks enough to me. Or do you have any argument that the taints cannot play such role? Topology Spread takes taints into consideration (ref).

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler,

I've often wished...

That "communication mechanism" is the current design of this KEP. We give a new condition to the Pod, the cluster autoscaler gives TriggeredScaleUp: false to the Pod condition when they cannot get a Node for that unsched Pod (due to stockout etc), and the scheduler does the fallback if the Pod has that condition. Please check out the current draft ↓, and let's bring technical discussion there if any. https://github.com/kubernetes/enhancements/pull/4150


So, it's the simplest and works well that "unschedulable Pods" are always the medium of communication between kube-scheduler and the cluster autoscaler. Introducing another CRD or something, it'd introduce something makes things complicated. That's why I'm proposing the current design - introduce a new condition in Pod, in which the cluster autoscaler can tell the scheduler that they couldn't make any Node for the unsched Pod. So, we can say putting PodScheduled: false condition to the Pod is the request from the scheduler to the CA, and a new condition on Pods would be considered to be a response from the CA.

tzneal commented 11 months ago

node is ready, but all pods that schedule to the new node fail due to some other sort of issue affecting the topology domain

That's not the situation I'm describing. The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain which is sufficient to break workloads, but still allows nodes to launch and go Ready. In that case, this proposal doesn't help as the autoscaler can happily continue to create nodes which appear functional, but are not.

Giving taints to such Nodes in domain looks enough to me. Or do you have any argument that the taints cannot play such role? Topology Spread takes taints into consideration (https://github.com/kubernetes/enhancements/issues/3094).

For the gray failure situation, I need to:

There's no common method I can use to inform every interested party that "for right now topology domain X is bad, change your decision making accordingly".

sanposhiho commented 11 months ago

I'm not sure here's the right place to talk about your story then. This enhancement is how we do fallback when scheduling keeps failing due to a required topology spread. You are talking about the domain failure, which is invisible from the scheduler. (unless we give taints manually) I interpret your challenge to be how to make such failure visible to components (scheduler, CA, etc)

Can you create another issue in k/k to discuss about your user-story there? Then, can you elaborate your story more in that issue?

I'm wondering if stories in your mind sre possible to automatically detect such situation. If possible,

would be the only necessary improvements.

tzneal commented 11 months ago

when scheduling keeps failing due to a required topology spread

In my thinking, that is a subset of the larger problem "a topology domain is no longer viable in some way". Solving that one would solve the "inability to schedule", while also handling the "can schedule, but it won't work if it does".

I'm wondering if stories in your mind are possible to automatically detect such situation.

It could be automated by some other decoupled component, e.g. your cloud provider sends a notification that a controller receives and then creates a "zone-A is invalid" object which every interested party consumes.

sanposhiho commented 11 months ago

Please note that the scheduler's responsibility is to schedule pods based on visible status.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

So, what you are saying is the lack of visibility to scheduler. If the scheduler could understand something wrong in the domain which would block pod startup, the pods would not go to that domain. That visibility improvement might be achieved by only taints like I said, or by a new CRD like you said, etc... we don't know yet.

But, this KEP tries to do fallback to solve problems described in a draft KEP. That's it. It doesn't try to strengthen the ability to notice domain failures. That is completely out-of-scope.

That's why I want to distinguish this KEP and the problem you have.

So, sorry again, could you please create another issue in k/k with the specific case that ↓ could happen. I still don't get what exact scenario you have in your mind. We can discuss what we need to improve based on that in that new issue, not here.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

tzneal commented 11 months ago

Please note that the scheduler's responsibility is to schedule pods based on visible status.

Yes, I think a proposal of having a node autoscaler make its inability to launch a node visible to the scheduler to it can apply scheduling rules differently is similar but not as expressive as directly marking a topology domain as invalid for the scheduler and other consumers.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

So, what you are saying is the lack of visibility to scheduler. If the scheduler could understand something wrong in the domain which would block pod startup, the pods would not go to that domain. That visibility improvement might be achieved by only taints like I said, or by a new CRD like you said, etc... we don't know yet.

It's a superset of the "autoscaler can't launch a node", and I think its a more common issue. You could also solve "autoscaler can't launch a node" by tainting all of the nodes in the problem domain and using a nodeTaintsPolicy=Honor on the topology spread constraints.

But, this KEP tries to do fallback to solve problems described in a draft KEP. That's it. It doesn't try to strengthen the ability to notice domain failures. That is completely out-of-scope.

That's why I want to distinguish this KEP and the problem you have.

To be clear, I'm not arguing for noticing domain failures. I just want a mechanism for users to be able to handle them without updating all of their workloads with new node affinities to avoid the problem domain. My argument is to push the KEP towards solving the larger problem.

So, sorry again, could you please create another issue in k/k with the specific case that ↓ could happen. I still don't get what exact scenario you have in your mind. We can discuss what we need to improve based on that in that new issue, not here.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

sanposhiho commented 11 months ago

It's too vague to talk here. Could you create a new draft KEP PR as alternative solution then? You can associate this KEP number and, for now, don't need to fill in all parts, but in only some core parts describing the design. Then we can compare two draft PRs based on them.

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alculquicondor commented 7 months ago

@sanposhiho are you still pushing for this in 1.30?

sanposhiho commented 7 months ago

I'm working on the investigation on CA side though, v1.30 is nearly impossible.

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

sanposhiho commented 6 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sanposhiho commented 3 months ago

/remove-lifecycle stale

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sanposhiho commented 2 weeks ago

/remove-lifecycle stale