kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
7.8k stars 3.87k forks source link

Cluster Autoscaler: align core concept naming with Karpenter #6647

Open towca opened 3 months ago

towca commented 3 months ago

Which component are you using?: Cluster Autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Recently, Karpenter officially joined sig-autoscaling, and we now have 2 Node autoscalers officially supported by Kubernetes. The naming of core concepts between the two is different:

This could be confusing to our users, especially the ones interacting with both autoscalers e.g. in a multi-cloud scenario. Consistent naming would also make it easier to document Node autoscaling in k8s docs.

Describe the solution you'd like.:

Start using "provisioning" and "consolidation" names instead of "scale-up" and "scale-down" in Cluster Autoscaler. We'd start with changing it in all CA documentation (while leaving the former name close-by for some time), and would use it in new code. In time, we could clean up the existing code.

Describe any alternative solutions you've considered.:

Additional context.:

towca commented 3 months ago

@MaciekPytel @gjtempleton @jonathan-innis I want to discuss this during the next sig meeting if possible, could you take a look?

sftim commented 3 months ago

/sig docs

sftim commented 3 months ago

Also, Karpenter removes nodes for reasons other than consolidation (eg: upcoming spot interruption risk). The ~drivers~ motivations for Karpenter to remove a NodeClaim can include reduced demand on node resources, cost optimization even where the resource demand is unchanged, or an action to address drift.

MaciekPytel commented 3 months ago

@sftim Correct (except the last one - CA does drain itself, similarly to Karpenter) - and there are other differences between the projects too. Very broadly there are many differences from the perspective of setting up and maintaining a cluster managed by Karpenter and CA.

However, from perspective of running a workload once the cluster is set up there is remarkably little difference - new nodes are provisioned based on pending pods and their scheduling requirements, underutilized nodes are consolidated based on binpacking simulation of how pods running on those nodes would be rescheduled, PDBs and various do-not-evict annotations are respected, the list goes on.

The fact that we're calling substantially the same functionality differently, that we have project-specific annotations doing essentially the same thing (e.g. safe-to-evict / do-not-evict), etc is just creating unnecessary complexity for the users who want to migrate between autoscaler and / or use both at the same time. We already discussed this with the Karpenter team and we want to work together to help remove those as much as it makes sense - and this is a first step in this direction.

That doesn't mean that we're planning to merge projects or anything similar. As you mentioned there are some pretty fundamental differences between the projects (particularly their intended scope - CA is just node autoscaling, Karpenter takes on much broader responsibility).

njtran commented 3 months ago

One note from the Karpenter side here and echoing @sftim:

Karpenter takes a bit more responsibility in managing the scale-down behaviors of the node, and we encapsulate all of this within our disruption controller. Consolidation is one of these behaviors, along with spec-drift, and a time-based node recycling mechanism. What are your thoughts on disruption vs Consolidation?

In addition, all forms of disruption are also sometimes tied to provisioning like @sftim said:

Karpenter sometimes performs provisioning as part of consolidation, whereas the cluster autoscaler doesn't (AIUI) do this

One question: is there anything within CAS that references the same wording that's used in documentation? For instance, an envionment variable that says scale-down-cooldown? If we align on the documentation to be "provisioning" or "disruption", would we also need to findall+replace instances of scale-down and scale-up?

elmiko commented 3 months ago

i like this discussion and while i think there is very good reason that both projects use different terminology based on the specific functionality, i agree with @MaciekPytel that this is probably just adding confusion to something that appears the same to the user regardless of which technology they are using.

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

towca commented 1 week ago

This is blocked on #6646 which still needs some time.

/remove-lifecycle stale