Cluster Autoscaler: align core concept naming with Karpenter

towca commented 3 months ago

Which component are you using?: Cluster Autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Recently, Karpenter officially joined sig-autoscaling, and we now have 2 Node autoscalers officially supported by Kubernetes. The naming of core concepts between the two is different:

CA: (Node) "scale-up", Karpenter: (Node) "provisioning"
CA: (Node) "scale-down", Karpenter: (Node) "consolidation"

This could be confusing to our users, especially the ones interacting with both autoscalers e.g. in a multi-cloud scenario. Consistent naming would also make it easier to document Node autoscaling in k8s docs.

Describe the solution you'd like.:

Start using "provisioning" and "consolidation" names instead of "scale-up" and "scale-down" in Cluster Autoscaler. We'd start with changing it in all CA documentation (while leaving the former name close-by for some time), and would use it in new code. In time, we could clean up the existing code.

Describe any alternative solutions you've considered.:

Start using "scale-up" and "scale-down" in Karpenter. These terms seem worse to me, I think outside of Cluster Autoscaler they're used for vertical scaling, not horizontal scaling (which would be scale-out).
Do not align on naming the core concepts between CA and Karpenter. In my opinion, this would lead to more confusion in the k8s community long-term than the renaming in CA.

Additional context.:

Doc describing the alignment between CA and Karpenter: CA/Karpenter alignment AEP
I want to bring this up for discussion during the sig-autoscaling meeting on ~~2024-03-25~~ TBD.

towca commented 3 months ago

@MaciekPytel @gjtempleton @jonathan-innis I want to discuss this during the next sig meeting if possible, could you take a look?

sftim commented 3 months ago

/sig docs

sftim commented 3 months ago

Cluster autoscaler does often perform scale up actions (eg raise the desired size of an AWS autoscaling group) with node provisioning as a side effect of that scale up action. This is different from Karpenter which directly launches cloud infrastructure.
Karpenter sometimes performs provisioning as part of consolidation, whereas the cluster autoscaler doesn't (AIUI) do this
I believe that the cluster autoscaler expects a separate component to drain underutilized nodes, whereas Karpenter actively takes part in the drain process. The cluster autoscaler does reduce .spec.replicas for a MachineDeployment if you use the cluster API, and that does look like a scale down.

Also, Karpenter removes nodes for reasons other than consolidation (eg: upcoming spot interruption risk). The ~drivers~ motivations for Karpenter to remove a NodeClaim can include reduced demand on node resources, cost optimization even where the resource demand is unchanged, or an action to address drift.

MaciekPytel commented 3 months ago

@sftim Correct (except the last one - CA does drain itself, similarly to Karpenter) - and there are other differences between the projects too. Very broadly there are many differences from the perspective of setting up and maintaining a cluster managed by Karpenter and CA.

However, from perspective of running a workload once the cluster is set up there is remarkably little difference - new nodes are provisioned based on pending pods and their scheduling requirements, underutilized nodes are consolidated based on binpacking simulation of how pods running on those nodes would be rescheduled, PDBs and various do-not-evict annotations are respected, the list goes on.

The fact that we're calling substantially the same functionality differently, that we have project-specific annotations doing essentially the same thing (e.g. safe-to-evict / do-not-evict), etc is just creating unnecessary complexity for the users who want to migrate between autoscaler and / or use both at the same time. We already discussed this with the Karpenter team and we want to work together to help remove those as much as it makes sense - and this is a first step in this direction.

That doesn't mean that we're planning to merge projects or anything similar. As you mentioned there are some pretty fundamental differences between the projects (particularly their intended scope - CA is just node autoscaling, Karpenter takes on much broader responsibility).

njtran commented 3 months ago

One note from the Karpenter side here and echoing @sftim:

Karpenter takes a bit more responsibility in managing the scale-down behaviors of the node, and we encapsulate all of this within our disruption controller. Consolidation is one of these behaviors, along with spec-drift, and a time-based node recycling mechanism. What are your thoughts on disruption vs Consolidation?

In addition, all forms of disruption are also sometimes tied to provisioning like @sftim said:

Karpenter sometimes performs provisioning as part of consolidation, whereas the cluster autoscaler doesn't (AIUI) do this

One question: is there anything within CAS that references the same wording that's used in documentation? For instance, an envionment variable that says scale-down-cooldown? If we align on the documentation to be "provisioning" or "disruption", would we also need to findall+replace instances of scale-down and scale-up?

elmiko commented 3 months ago

i like this discussion and while i think there is very good reason that both projects use different terminology based on the specific functionality, i agree with @MaciekPytel that this is probably just adding confusion to something that appears the same to the user regardless of which technology they are using.

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

towca commented 1 week ago

This is blocked on #6646 which still needs some time.

/remove-lifecycle stale

kubernetes / autoscaler

Cluster Autoscaler: align core concept naming with Karpenter #6647