Partitioned NodePool Multi-node Consolidation

jashandeep-sohi commented 8 months ago

Description

Observed Behavior:

I have a few NodePools that I'm using in a "partitioned" manner. Basically, each NodePool is made independent using user-defined requirements & taints, and Pods in different namespaces use different Nodepools.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: one
spec:
   taints:
        - key: example.com/partition-key
          value: one
          effect: NoSchedule
   requirements:
    - key: example.com/partition-key
      operator: In
      values: ["one"]
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: two
spec:
   taints:
        - key: example.com/partition-key
          value: two
          effect: NoSchedule
   requirements:
    - key: example.com/partition-key
      operator: In
      values: ["two"]

This works fine for the most part, but I'm observing issues with multi-node consolidation.

As far as I can tell, multi-node consolidation looks at all deprovisionable Nodes together: https://github.com/kubernetes-sigs/karpenter/blob/cc54b340f630b46a26d19a3cbd49d90c8b3a6d45/pkg/controllers/disruption/multinodeconsolidation.go#L44C42-L44C42

Which I think means there's no multi-node consolidation happening (or it's sub-optimal at best). Shouldn't this be done on groups of compatible NodePools independently?

Another place where I think this is a problem is when simulating the scheduling you look at all Pending Pods: https://github.com/kubernetes-sigs/karpenter/blob/cc54b340f630b46a26d19a3cbd49d90c8b3a6d45/pkg/controllers/disruption/helpers.go#L97C35-L97C35

But if one of those Pending Pods is not compatible with the firstN candidates chosen from all Nodes, then simulation will always complain about unschedulable Pods (highly likely as the number of nodes/paritions increase)

Expected Behavior:

NodePools should be consolidated in groups computed based on their requirements or based on some configurable partition key.

Reproduction Steps (Please include YAML): See above

Versions:

Chart Version: v0.33.0
Kubernetes Version (kubectl version): 1.27
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jonathan-innis commented 8 months ago

This works fine for the most part, but I'm observing issues with multi-node consolidation

We've talked about this a little bit for deprovisioning improvements for the exact reason that you called out: If we order without considering a NodePool boundary, we have some potential to get stuck if those boundaries are independent of each other. There's potential to perhaps try another form of multi-node consolidation where we perform n different multi-node consolidations where n is the number of NodePools on the cluster.

Also, I'm curious: Are you still seeing single node consolidation? I would expect that if we were failing to multi-node consolidaste, we would still attempt to single node consolidate if there are nodes available.

jashandeep-sohi commented 8 months ago

Also, I'm curious: Are you still seeing single node consolidation? I would expect that if we were failing to multi-node consolidaste, we would still attempt to single node consolidate if there are nodes available.

I remember this being an issue on an older version, but I'm not seeing it getting blocked on single-node consolidation anymore. I think before it would just short-circuit if there were any Pods in pending state.

There's potential to perhaps try another form of multi-node consolidation where we perform n different multi-node consolidations where n is the number of NodePools on the cluster.

But what if there are 2+ NodePools in each "partition"? For example, with a different NodePool.spec.weight to achieve some kind of priority within a partition. Would doing consolidation on them independently still work? I would think it has to be done on groups of compatible NodePools.

Has introducing the concept of NodePool groups as an explicit API ever been discussed? Something like having a NodePool.spec.group key might make it easier to figure out those boundaries. Figuring it out based on NodePool.spec.template.spec.requirements is probably computationally expensive. Also, would be nice if group was non-nil, the controller would auto-magically inject taints, labels & requirements like karpenter.sh/nodepool-group={group}. I guess what I'm asking for is first-class support for partitioning.

GnatorX commented 5 months ago

https://github.com/kubernetes-sigs/karpenter/issues/488 I think similar thought and perf improvement.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

cnmcavoy commented 1 month ago

I've been looking at multi-node consolidation and I also discovered these findings independently. The consolidation occurs across all nodepools, which is not ideal if you have very different nodepool configurations in your cluster. If the nodepools are similar, I could see this being a positive, but that is not how our clusters are designed.

There is one other factor that we found that was equally significant if not more significant, multi-node consolidation also does not take into account node architectures. So an amd64 node may be consolidated with a arm node. This in theory could succeed if the workloads were all multi-arch compatible, but that is not the case in our workload clusters, so this consolidation also always fails.

So the combination of nodepool mixing + architecture mixing means multi-node consolidation effectively never finds a successful simulation in our clusters.

cnmcavoy commented 1 month ago

/remove-lifecycle rotten

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/karpenter/issues/853#issuecomment-2282853802): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / karpenter

Partitioned NodePool Multi-node Consolidation #853

Description