PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode

sanposhiho commented 1 year ago

Enhancement Description

One-line enhancement description (can be used as a release note): PodTopologySpread DoNotSchedule-to-ScheduleAnyway fallback mode based on new Pod conditions.
Kubernetes Enhancement Proposal: https://github.com/kubernetes/enhancements/pull/4150
Discussion Link: https://github.com/kubernetes/kubernetes/issues/105977
Primary contact (assignee): @sanposhiho
Responsible SIGs: sig-scheduling sig-autoscaling
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): v1.30
- Beta release target (x.y): v1.31
- Stable release target (x.y): v1.33
[ ] Alpha
- [ ] KEP (k/enhancements) update PR(s):
- [ ] Code (k/k) update PR(s):
- [ ] Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

/sig scheduling /assign

sanposhiho commented 1 year ago

/sig autoscaling

knelasevero commented 1 year ago

@sanposhiho do you have ETA for the KEP to be up? Any way we could help?

/cc @a7i

sanposhiho commented 1 year ago

I recently started a draft locally, but either way, as written at the top, we - sig-scheduling - don't plan to have this enhancement in v1.28 and gonna be in v1.29 at the earliest. You can help us improve the design once I've created the PR for KEP.

sanposhiho commented 1 year ago

@ahg-g @alculquicondor @Huang-Wei Can we target this towards v1.29? I'll write the KEP + take the implementation part, but looking for main reviewer(s) who can work on in the next release cycle.

alculquicondor commented 1 year ago

I'm in favor, but make sure you also have a reviewer from sig-autoscaling

sanposhiho commented 1 year ago

Thanks @alculquicondor.

@gjtempleton @mwielgus Hello. Could either of you help us as a reviewer on this enhancement? The cluster autoscaler is relevant to this enhancement, like CA tells the scheduler that it cannot bring any new Nodes in and the scheduler decides to fallback. (See https://github.com/kubernetes/kubernetes/issues/105977 for more details)

MaciekPytel commented 1 year ago

Hi, I've been thinking about this for a long time and as a top-level owner of Cluster Autoscaler I'd be happy to get involved.

That being said - I'd like to start discussion with a possibly very controversial question: is scheduling the best layer to address zonal spreading?

Problem statement

Today I would recommend against anyone using PodTopologySpreading on a zonal topology in a cloud environment:

When using whenUnsatisfiable: DoNotSchedule topology spreading introduces two major problems:
- In the event of provider being out of capacity (stock-out) pods would not be able to schedule at all - they can't schedule in the zone with smallest number of replicas, because there is no capacity there. And they can't schedule in other zones, because it would violate the topology spreading constraints.
- During zone failure, topology spreading will prevent pods from affected nodes from falling back to other zones unless every single node in an affected zone is removed. In my experience partial failures are more common than complete failure where every single node is gone.
The alternative is to set whenUnsatisfiable: ScheduleAnyway, but in an autoscaled cluster this is unlikely to achieve much:
- Autoscaling only looks at Filters and ignores Scores, so this will not affect in which zones new nodes are provided.
- Autoscaling is only triggered by unschedulable pods, so in practice the scheduler will put the pods wherever there is free capacity and new nodes will not be provided at all. Unless cluster utilization is quite low, chances are that "wherever there is free capacity" doesn't result in pods being spread out between zones.

My understanding is that the goal of this proposal is to address the problems with whenUnsatisfiable: DoNotSchedule spreading that I described above. Am I correct?

Challenges

I'd love to see this solved, but I don't have any good ideas on how to do it. A timeout-based approach seems very fragile as node provisioning times vary widely between clouds and even within a single cloud they may be very different based on the type of hardware being used (ex. nodes with GPU often take more time to startup and initialize). And what happens when preferred instance types are unavailable (ex. stockout) and autoscaler needs to fallback to a different type - that would add anywhere between a few seconds to 15+ minutes of extra latency depending on the cloud and the exact reason for node creation failure. We can say that it's up to the user to set the right timeout, but how many users will be able to set the right timeout in practice? I've been working on Cluster Autoscaler in OSS and GKE for >6 years and I have no idea what a good timeout value would be on GCP - much less in any other cloud.

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler, but for that I think we should have an idea how to support this in Autoscaler: how do we make it aware that it should update the pod at all? Today Autoscaler just imports scheduler code and runs PreFilters/Filters, without any understanding of what they actually check. How would it know that a pod is unschedulable because of TopologySpreading and not some other constraint? - and I mean the question in the sense of how much of Autoscaler would we have to refactor, not just a high-level conceptual answer.

Alternatives that may be worth discussing

Finally, even if we solve all of those issues - scheduling pods in a way that respects topology spreading constraint still wouldn't guarantee equal spreading between zones. If one zone is temporarily stocked-out, the scheduler will not be able to restore the spreading after the instances become available again.

That brings me to the controversial question: wouldn't it be better to solve this at Deployment/ReplicaSet/StatefulSet/etc level instead? Any such controller could target pod to specific zones (ex. by setting nodeSelector for different zones) and it could continuously reconcile the number of pods in each zone. This would also address the problem of timeout - we could fallback to a different zone after a relatively short timeout, knowing that we can always restore the spreading as soon as the capacity becomes available.

This is the approach taken by https://github.com/kubernetes/autoscaler/blob/master/balancer/proposals/balancer.md proposal. I'm not sure balancer is the best way to implement this either - I'm not as familiar with whatever challenges this approach may be facing. But I think it would be good to start discussion from agreeing on what problems we're trying to solve and evaluating which component could best solve them, before jumping into any particular implementation.

sanposhiho commented 1 year ago

Thanks @MaciekPytel for getting involved!

I believe we should continue to discuss a detailed design in KEP PR instead of here, but let me roughly answer your questions.

In the issue, we're considering having new whenUnsatisfiable, DoNotSchedule_BestEffort and PreemptOrSchedule. (names can be officially decided in KEP) DoNotSchedule_BestEffort is the one CA will be related. When DoNotSchedule_BestEffort, the scheduler keeps trying to schedule the Pod while assuming TopologySpread is DoNotSchedule. But, when the scheduler somehow knows that CA cannot create Node for this Pod, then fallback to ScheduleAnyway so that the Pod will get scheduled without being blocked by PodTopologySpread.

Then, next it comes to how to know that CA cannot create Node for the Pod. We discussed two options:

waits a certain period of time (it's like a timeout). If new Node isn't created for a certain period, then fallback to ScheduleAnyway. (ref)
create new Pod condition TriggeredScaleUp and CA set it false if it cannot create new Node for the Pod. (ref)

In the issue, we kind of concluded to prefer option 2, new Pod condition. So, CA's responsibility is to add false to new Pod condition TriggeredScaleUp. The reason was, as you said, it's difficult to set an appropriate timeout. (ref)

It's the current status of our discussion, and I'm going to create the KEP based on this. So, answering your questions

My understanding is that the goal of this proposal is to address the problems with whenUnsatisfiable: DoNotSchedule spreading that I described above. Am I correct?

Yes. Exactly correct.

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler

And yes it's exactly what we concluded (at least in the issue).

How would it know that a pod is unschedulable because of TopologySpreading and not some other constraint?

I believe, to make it simple, CA doesn't need to do something special for TopologySpread. It only needs to set the condition when it cannot create a node for unschedulable Pod, regardless of the reason. And when the scheduler sees that condition, fallback to SchedulingAnyway in anyways. If unschedulable isn't caused by TopologySpread, the Pod should still be unschedulable even after the fallback.

If one zone is temporarily stocked-out, the scheduler will not be able to restore the spreading after the instances become available again.

Such "rescheduling", "rebalancing" is the responsibility of descheduler, not the scheduler. So, we don't need to concern much about the rebalancing in the scheduler.

MaciekPytel commented 1 year ago

I believe we should continue to discuss a detailed design in KEP PR instead of here

Sounds good to me. Please tag me on PR and also please feel free to ping me on slack if you want to discuss any CA-releated parts.

sanposhiho commented 1 year ago

@alculquicondor @MaciekPytel Can we target this enhancement to v1.29? (based on your bandwidth?)

sanposhiho commented 1 year ago

We skipped this v1.29 release. Let's aim at the next one hopefully. We'll discuss it in an upcoming SIG-Autoscaling meeting. (Oct 23rd)

tzneal commented 1 year ago

This handles one part of the scheduling problem (an autoscaler is unable to launch new capacity) but doesn't handle the case where an autoscaler launches new capacity, but its degraded in some way (e.g. node is ready, but all pods that schedule to the new node fail due to some other sort of issue affecting the topology domain). Has there been any thoughts on allowing overriding scheduling restrictions during gray failures?

When thinking on it, I was considering a CRD that a user could create to indicate to the scheduler/autoscaler/anyone else that a particular topology domain is now invalid and shouldn't count for topology spread purposes, and no pods should be scheduled to that domain. Autoscalers could read the same CRD and avoid attempting to scale up nodes in that domain as well.

There are a few other advantages to being able to imperatively indicate to multiple consumers that a topology domain is bad:

maybe I know before the scale-up occurs (e.g. my cloud provider told me that an AZ is having issues) and I can avoid the launch and delay by marking that topology domain
future integration with something like descheduler to move my existing pods out of the affected topology domain

ellistarn commented 1 year ago

Replicaset Spread

That brings me to the controversial question: wouldn't it be better to solve this at Deployment/ReplicaSet/StatefulSet/etc level instead? Any such controller could target pod to specific zones (ex. by setting nodeSelector for different zones) and it could continuously reconcile the number of pods in each zone. This would also address the problem of timeout - we could fallback to a different zone after a relatively short timeout, knowing that we can always restore the spreading as soon as the capacity becomes available.

I quite like this line of thinking. To add to it (though you may be implying this already), it could continue to be part of the podtemplatespec, but would result in the replicaset controller applying additional corresponding nodeSelectors to the physical pods it creates. You could even re-use the topology spread constraints API surface, and just shift the spread responsibilities to replicaset, instead of scheduler.

To be explicit:

user creates a deployment, w/ topologyspreadconstraints in the podtemplatespec
deployment creates a replicaset w/ the full podtemplatespec
replicaset controller applies additional nodeSelectors to each pod, in an effort to enforce spread
replicaset controller could continually reconcile pods to maintain a balance
kube-scheduler (and autoscalers) simply don't need to think about topologyspreadconstraints

Scheduler / Autoscaler coordination

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler,

I've often wished that the scheduler and autoscaler were the same component, as it unlocks the ability to make and enforce decisions in the same component, avoiding race conditions like the this KEP attempts to address. Of course, there are ways to communicate these decisions between systems, but communication protocols are hard (Kube API Server objects, or otherwise). This is most likely a dead end, given where Kubernetes is today, but given that @MaciekPytel is opening up controversial questions, I figure I might throw this one into the ring ;)

Topology API Object

When thinking on it, I was considering a CRD that a user could create to indicate to the scheduler that a particular topology domain is now invalid and shouldn't count for topology spread purposes, and no pods should be scheduled to that domain. Autoscalers could read the same CRD and avoid attempting to scale up nodes in that domain as well.

This would be very useful to achieve usecases like "disable this AZ while we ride out this outage" https://github.com/aws/karpenter/issues/4727

sanposhiho commented 1 year ago

node is ready, but all pods that schedule to the new node fail due to some other sort of issue affecting the topology domain

The cluster autoscaler should take all scheduling constraints into consideration when they do the simulation. If it does properly, that kind of scenario (Node is created, but the unsched Pod cannot get scheduled to that Node actually) shouldn't happen. Or, in some minor scenario (like higher priority Pod is created very recently, after the CA's reconciliation), the Pod may be still unschedulable again after the first CA's reconciliation. But, that should be resolved in the next reconciliation of CA.

the scheduler/autoscaler/anyone else that a particular topology domain is now invalid

Giving taints to such Nodes in domain looks enough to me. Or do you have any argument that the taints cannot play such role? Topology Spread takes taints into consideration (ref).

An alternative would be some sort of communication mechanism between Autoscaler and Scheduler,

I've often wished...

That "communication mechanism" is the current design of this KEP. We give a new condition to the Pod, the cluster autoscaler gives TriggeredScaleUp: false to the Pod condition when they cannot get a Node for that unsched Pod (due to stockout etc), and the scheduler does the fallback if the Pod has that condition. Please check out the current draft ↓, and let's bring technical discussion there if any. https://github.com/kubernetes/enhancements/pull/4150

So, it's the simplest and works well that "unschedulable Pods" are always the medium of communication between kube-scheduler and the cluster autoscaler. Introducing another CRD or something, it'd introduce something makes things complicated. That's why I'm proposing the current design - introduce a new condition in Pod, in which the cluster autoscaler can tell the scheduler that they couldn't make any Node for the unsched Pod. So, we can say putting PodScheduled: false condition to the Pod is the request from the scheduler to the CA, and a new condition on Pods would be considered to be a response from the CA.

tzneal commented 1 year ago

node is ready, but all pods that schedule to the new node fail due to some other sort of issue affecting the topology domain

That's not the situation I'm describing. The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain which is sufficient to break workloads, but still allows nodes to launch and go Ready. In that case, this proposal doesn't help as the autoscaler can happily continue to create nodes which appear functional, but are not.

Giving taints to such Nodes in domain looks enough to me. Or do you have any argument that the taints cannot play such role? Topology Spread takes taints into consideration (https://github.com/kubernetes/enhancements/issues/3094).

For the gray failure situation, I need to:

taint all nodes to inform the scheduler
reconfigure any autoscalers to avoid the topology domain
inform any other components that might need to be aware

There's no common method I can use to inform every interested party that "for right now topology domain X is bad, change your decision making accordingly".

sanposhiho commented 1 year ago

I'm not sure here's the right place to talk about your story then. This enhancement is how we do fallback when scheduling keeps failing due to a required topology spread. You are talking about the domain failure, which is invisible from the scheduler. (unless we give taints manually) I interpret your challenge to be how to make such failure visible to components (scheduler, CA, etc)

Can you create another issue in k/k to discuss about your user-story there? Then, can you elaborate your story more in that issue?

I'm wondering if stories in your mind sre possible to automatically detect such situation. If possible,

give taints to nodes in the domain
not give ready to new nodes in the domain

would be the only necessary improvements.

tzneal commented 1 year ago

when scheduling keeps failing due to a required topology spread

In my thinking, that is a subset of the larger problem "a topology domain is no longer viable in some way". Solving that one would solve the "inability to schedule", while also handling the "can schedule, but it won't work if it does".

I'm wondering if stories in your mind are possible to automatically detect such situation.

It could be automated by some other decoupled component, e.g. your cloud provider sends a notification that a controller receives and then creates a "zone-A is invalid" object which every interested party consumes.

sanposhiho commented 1 year ago

Please note that the scheduler's responsibility is to schedule pods based on visible status.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

So, what you are saying is the lack of visibility to scheduler. If the scheduler could understand something wrong in the domain which would block pod startup, the pods would not go to that domain. That visibility improvement might be achieved by only taints like I said, or by a new CRD like you said, etc... we don't know yet.

But, this KEP tries to do fallback to solve problems described in a draft KEP. That's it. It doesn't try to strengthen the ability to notice domain failures. That is completely out-of-scope.

That's why I want to distinguish this KEP and the problem you have.

So, sorry again, could you please create another issue in k/k with the specific case that ↓ could happen. I still don't get what exact scenario you have in your mind. We can discuss what we need to improve based on that in that new issue, not here.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

tzneal commented 1 year ago

Please note that the scheduler's responsibility is to schedule pods based on visible status.

Yes, I think a proposal of having a node autoscaler make its inability to launch a node visible to the scheduler to it can apply scheduling rules differently is similar but not as expressive as directly marking a topology domain as invalid for the scheduler and other consumers.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

So, what you are saying is the lack of visibility to scheduler. If the scheduler could understand something wrong in the domain which would block pod startup, the pods would not go to that domain. That visibility improvement might be achieved by only taints like I said, or by a new CRD like you said, etc... we don't know yet.

It's a superset of the "autoscaler can't launch a node", and I think its a more common issue. You could also solve "autoscaler can't launch a node" by tainting all of the nodes in the problem domain and using a nodeTaintsPolicy=Honor on the topology spread constraints.

But, this KEP tries to do fallback to solve problems described in a draft KEP. That's it. It doesn't try to strengthen the ability to notice domain failures. That is completely out-of-scope.

That's why I want to distinguish this KEP and the problem you have.

To be clear, I'm not arguing for noticing domain failures. I just want a mechanism for users to be able to handle them without updating all of their workloads with new node affinities to avoid the problem domain. My argument is to push the KEP towards solving the larger problem.

So, sorry again, could you please create another issue in k/k with the specific case that ↓ could happen. I still don't get what exact scenario you have in your mind. We can discuss what we need to improve based on that in that new issue, not here.

The node can become ready, pods can schedule correctly, but fail to start due to some underlying failure particular to the topology domain

sanposhiho commented 1 year ago

It's too vague to talk here. Could you create a new draft KEP PR as alternative solution then? You can associate this KEP number and, for now, don't need to fill in all parts, but in only some core parts describing the design. Then we can compare two draft PRs based on them.

k8s-triage-robot commented 10 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alculquicondor commented 10 months ago

@sanposhiho are you still pushing for this in 1.30?

sanposhiho commented 10 months ago

I'm working on the investigation on CA side though, v1.30 is nearly impossible.

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

sanposhiho commented 9 months ago

/remove-lifecycle rotten

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sanposhiho commented 6 months ago

/remove-lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sanposhiho commented 3 months ago

/remove-lifecycle stale

k8s-triage-robot commented 1 day ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sanposhiho commented 1 day ago

/remove-lifecycle stale

kubernetes / enhancements