kubernetes-sigs / karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Apache License 2.0
526 stars 173 forks source link

Partitioned NodePool Multi-node Consolidation #853

Closed jashandeep-sohi closed 2 weeks ago

jashandeep-sohi commented 8 months ago

Description

Observed Behavior:

I have a few NodePools that I'm using in a "partitioned" manner. Basically, each NodePool is made independent using user-defined requirements & taints, and Pods in different namespaces use different Nodepools.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: one
spec:
   taints:
        - key: example.com/partition-key
          value: one
          effect: NoSchedule
   requirements:
    - key: example.com/partition-key
      operator: In
      values: ["one"]
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: two
spec:
   taints:
        - key: example.com/partition-key
          value: two
          effect: NoSchedule
   requirements:
    - key: example.com/partition-key
      operator: In
      values: ["two"]

This works fine for the most part, but I'm observing issues with multi-node consolidation.

As far as I can tell, multi-node consolidation looks at all deprovisionable Nodes together: https://github.com/kubernetes-sigs/karpenter/blob/cc54b340f630b46a26d19a3cbd49d90c8b3a6d45/pkg/controllers/disruption/multinodeconsolidation.go#L44C42-L44C42

Which I think means there's no multi-node consolidation happening (or it's sub-optimal at best). Shouldn't this be done on groups of compatible NodePools independently?

Another place where I think this is a problem is when simulating the scheduling you look at all Pending Pods: https://github.com/kubernetes-sigs/karpenter/blob/cc54b340f630b46a26d19a3cbd49d90c8b3a6d45/pkg/controllers/disruption/helpers.go#L97C35-L97C35

But if one of those Pending Pods is not compatible with the firstN candidates chosen from all Nodes, then simulation will always complain about unschedulable Pods (highly likely as the number of nodes/paritions increase)

Expected Behavior:

NodePools should be consolidated in groups computed based on their requirements or based on some configurable partition key.

Reproduction Steps (Please include YAML): See above

Versions:

jonathan-innis commented 8 months ago

This works fine for the most part, but I'm observing issues with multi-node consolidation

We've talked about this a little bit for deprovisioning improvements for the exact reason that you called out: If we order without considering a NodePool boundary, we have some potential to get stuck if those boundaries are independent of each other. There's potential to perhaps try another form of multi-node consolidation where we perform n different multi-node consolidations where n is the number of NodePools on the cluster.

Also, I'm curious: Are you still seeing single node consolidation? I would expect that if we were failing to multi-node consolidaste, we would still attempt to single node consolidate if there are nodes available.

jashandeep-sohi commented 8 months ago

Also, I'm curious: Are you still seeing single node consolidation? I would expect that if we were failing to multi-node consolidaste, we would still attempt to single node consolidate if there are nodes available.

I remember this being an issue on an older version, but I'm not seeing it getting blocked on single-node consolidation anymore. I think before it would just short-circuit if there were any Pods in pending state.

There's potential to perhaps try another form of multi-node consolidation where we perform n different multi-node consolidations where n is the number of NodePools on the cluster.

But what if there are 2+ NodePools in each "partition"? For example, with a different NodePool.spec.weight to achieve some kind of priority within a partition. Would doing consolidation on them independently still work? I would think it has to be done on groups of compatible NodePools.

Has introducing the concept of NodePool groups as an explicit API ever been discussed? Something like having a NodePool.spec.group key might make it easier to figure out those boundaries. Figuring it out based on NodePool.spec.template.spec.requirements is probably computationally expensive. Also, would be nice if group was non-nil, the controller would auto-magically inject taints, labels & requirements like karpenter.sh/nodepool-group={group}. I guess what I'm asking for is first-class support for partitioning.

GnatorX commented 5 months ago

https://github.com/kubernetes-sigs/karpenter/issues/488 I think similar thought and perf improvement.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

cnmcavoy commented 1 month ago

I've been looking at multi-node consolidation and I also discovered these findings independently. The consolidation occurs across all nodepools, which is not ideal if you have very different nodepool configurations in your cluster. If the nodepools are similar, I could see this being a positive, but that is not how our clusters are designed.

There is one other factor that we found that was equally significant if not more significant, multi-node consolidation also does not take into account node architectures. So an amd64 node may be consolidated with a arm node. This in theory could succeed if the workloads were all multi-arch compatible, but that is not the case in our workload clusters, so this consolidation also always fails.

So the combination of nodepool mixing + architecture mixing means multi-node consolidation effectively never finds a successful simulation in our clusters.

cnmcavoy commented 1 month ago

/remove-lifecycle rotten

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/karpenter/issues/853#issuecomment-2282853802): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.