kubernetes-sigs / descheduler

Descheduler for Kubernetes
https://sigs.k8s.io/descheduler
Apache License 2.0
4.33k stars 656 forks source link

NodeFit logic is incompatible with "includeSoftConstraints" option. #1067

Open eugenea opened 1 year ago

eugenea commented 1 year ago

What version of descheduler are you using?

descheduler version: v0.25.1

Does this issue reproduce with the latest release? Most likely

Which descheduler CLI options are you using? --policy-config-file /policy-dir/policy.yaml --logging-format json --v 3

Please provide a copy of your descheduler policy config file

    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    evictLocalStoragePods: true
    strategies:
      "RemovePodsViolatingTopologySpreadConstraint":
        enabled: true
        params:
          nodeFit: true
          includeSoftConstraints: true

What k8s version are you using (kubectl version)?

kubectl version Output
EKS 1.22

What did you do?

  1. Have at least 3 nodes in your cluster, label one node as "lbl=a" and label the rest nodes as "lbl=b"
  2. Take some deployment such as echoserver, deploy at least 6 replicas but set followin topology spread constraints:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: lbl
        labelSelector:
          matchLabels:
            app: echoserver
        whenUnsatisfiable: ScheduleAnyway
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        labelSelector:
          matchLabels:
            app: echoserver
        whenUnsatisfiable: DoNotSchedule

    I e you have one hard constraint for hostname with max skey of 1 and you have soft constraint for lbl label. With 6 pods that soft constrant will be violated but the pod will be scheduled anyway. However it will be kicked by descheduler because of "includeSoftConstraints"

  3. Now launch deployment and see how descheduler will start to continuously kick one pod:
    
    I0225 01:58:13.104557       1 evictions.go:160] "Evicted pod" pod="echoserver/echoserver-xxx" reason="" strategy="RemovePodsViolatingTopologySpreadConstraint" node="xxx"


**What did you expect to see?**
We expect "nodefit" logic to detect that "includeSoftConstraints" option is enabled and verify if pod has any place to go without soft constraint policy violation before evicting it due to soft constraint. 

**What did you see instead?**
"nodefit" logic does not care about soft contstraints and allows pod to be killed, then it will be scheduled anyway and will be killed during the next descheduler run.

Note that removing "includeSoftConstraints" option is not an option because descheduler was deployed for the sole purpose of distributing pods over spot instances when they are available (hence the soft constraint). Presence of just a single spot instance triggers this bug and pods start to being killed by descheduler during every descheduler run.
damemi commented 1 year ago

NodeFit doesn't currently check against any topology constraints (hard or soft). This is intentional, because NodeFit is meant as a safeguard check against criteria that the strategy might not be aware of such as resource utilization. Checking topology constraints in NodeFit would be redundant because the strategy already checks for hard constraints.

The important difference between hard and soft topology constraints is that hard constraints can never be violated during scheduling, while soft constraints can be violated. This is because the soft constraint is just a "scoring" plugin for the scheduler. It doesn't eliminate any nodes during pod placement if those nodes would violate the constraint, it just ranks nodes based on which node would violate the constraint the least.

That ranking is not deterministic, because it gets factored in with the scores of every other scheduler plugin (such as soft affinity, resource utilization, image locality, etc). This is the key challenge preventing us from considering soft constraints like you are suggesting, because we can't definitively say how the scheduler will actually rank the nodes for re-placement. To do so, we would need to:

  1. know the enabled scoring plugins for the scheduler
  2. know the weights and configuration of those plugins
  3. run the entire scoring cycle ourselves

The reason you're seeing an eviction loop is actually because the scheduler knows that there isn't any node the pod will fit without violating the skew, but it is choosing the same node over and over because that's just the "best" node. So, even if the descheduler did check that entire scoring cycle again, it would come to the same conclusion (and in our case do something like if newNode == currentNode { doNotEvict() }).

You might be able to fix your situation by tweaking your scheduler configuration, such as setting a higher weight for the PodTopologySpread score plugin or disabling other score plugins that you don't care about. Generally, when descheduler is showing an eviction loop, we recommend looking at your scheduling directives to see why the scheduler and descheduler are out of sync.

/remove-kind bug /kind feature

eugenea commented 1 year ago

I agree with everything you said, however that means enabling "includeSoftConstraints" without topology constraint check in nodefit is a certain recipe to eviction loop. And disabling that option removes useful feature - to spread cluster load using some soft criteria, for example to move pods into spot instances when they become available. If this is not the plan for descheduler roadmap, we'd have to search for another software that does it.

damemi commented 1 year ago

It's indeed not a perfect solution in many cases. Originally, the topology constraint descheduler strategy didn't even have the option to evict based on soft constraints for this exact reason. But some users found a use for it, so we added it as an option.

Ultimately this would be impossible for us to implement without simulating a full scheduling cycle before each eviction, which is not practical to do. Or, if we add event-based triggers (proposed in https://github.com/kubernetes-sigs/descheduler/issues/696), the descheduler could only run when a new node is added, for example. It might be possible to hack together something that creates a descheduler job on these events.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rittneje commented 1 year ago

@damemi We encountered the same concern while evaluating the descheduler. The descheduler needs to be fixed to only evict a pod with soft constraint violations if it knows it can be rescheduled somewhere better. Somehow the cluster-autoscaler is able to run such as simulation so I would think it is possible to do so here too.

Until this change is made, includeSoftConstraints should be deprecated with a warning that it is dangerous to use.

rittneje commented 1 year ago

/remove-lifecycle stale

damemi commented 1 year ago

Hi @rittneje, while I haven't worked directly on the cluster autoscaler, I know that it relies heavily on the scheduler framework (and actually parts of the scheduler framework were written specifically with CA in mind). With that, it effectively is a scheduler and is in a much better position to make scoring decisions like this. Refactoring the descheduler to use the scheduler framework is of course possible, but would be a large task. I'm happy to review proposals for this but it unfortunately isn't on our short term roadmap.

Though, without knowing the autoscaler internals myself, I would assume that the cluster autoscaler is still a best-effort estimate without access to the scheduler's config to know what other scheduler plugins could influence pod placement.

I agree that we could clarify how includeSoftConstraints works, because without a clear understanding of it it can be a footgun. But in the right circumstances other users have found it useful, which is why it was added.

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rittneje commented 7 months ago

/remove-lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rittneje commented 4 months ago

/remove-lifecycle stale

Hexcles commented 3 months ago

Shouldn't the new default topologyBalanceNodeFit: true arg in RemovePodsViolatingTopologySpreadConstraint prevent this from happening?

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

rittneje commented 3 weeks ago

/remove-lifecycle stale