make the autoscaler annotations and the replicas field in cluster.spec.topology mutual exclusive

neolit123 commented 6 months ago

What would you like to be added (User Story)?

as a user, i would like to have a single active method of scaling the workloads of a cluster. as a developer, i would like to sandbox the user so that they don't try to scale with autoscaler and by manually editing the replicas field at the same time.

Detailed Description

manually scaling workers by editing the replicas field in a Class-based cluster with cluster-autoscaler enabled can break cluster-autoscaler. as seen downstream with CAPV, in such a scenario cluster-autoscale can start creating new worker machines but then, these machines will get deleted immediately.

W0214 18:59:47.696095       1 clusterstate.go:271] Disabling scale-up for node group MachineDeployment/flt/usdc01-tkg-flt-dev-md-1-hwvj6 until 2024-02-14 19:29:47.496309333 +0000 UTC m=+1820198.755318576; errorClass=Other; errorCode=cloudProviderError
I0214 18:59:47.696125       1 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop
E0214 18:59:47.696129       1 static_autoscaler.go:497] Failed to scale up: failed to increase node group size: size decrease too large - desired:5 min:10
I0214 18:59:47.696411       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"867f70bf-b5b9-40dc-b7ad-75cedadbc089", APIVersion:"v1", ResourceVersion:"164068415", FieldPath:""}): type: 'Warning' reason: 'FailedToScaleUpGroup' Scale-up failed for group MachineDeployment/flt/usdc01-tkg-flt-dev-md-1-hwvj6: size decrease too large - desired:5 min:10

eventually this can cause datastore and VM cleanup problems.

some more quotes from the report:

Manually scaling a CC that got created with autoscaler enabled will add the Replicase parameter to the cluster k8s object for the CC "spec.topology.machineDeployments.class.replicas" . This will cause Cluster-Autoscaler to break since it will try to scale up the worker node once autoscale in triggered while cluster-api will try to enforce the number of Replaces count set in Cluster.spec.topology.machineDeployments.class.replicas and this will result of counties creation of deletion of the cluster node .

similar https://github.com/kubernetes-sigs/cluster-api/issues/8217#issuecomment-1452203256

Ex:

if the worker replicas is 2 and the auto scaler is enabled on the cluster and it got triggered to scale the worker node up to a number higher than the current worker replicas count, we will see workers nodes will get created and deleted in seconds and new pods will fail to get into running .

proposal

make the autoscaler annotations:

cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size
cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size

present on Cluster object in: cluster.Spec.Topology.Workers.MachineDeployments[x].Metadata.Annotations mutually exclusive with cluster.Spec.Topology.Workers.MachineDeployments[x].Replicas.

Q: how about CluserClass, what has to be done there?

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature /area clusterclass

neolit123 commented 6 months ago

happy to send a PR, if this is agreed on.

neolit123 commented 6 months ago

/area clusterclass

sbueringer commented 6 months ago

/triage accepted

Sounds good to me

Q: how about CluserClass, what has to be done there?

I think nothing. replicas cannot be set in a ClusterClass and it would be highly unusual to set the autoscaler annotations in a ClusterClass (as it would then apply to all Clusters that use a specific ClusterClass0.

neolit123 commented 6 months ago

Sounds good to me

any tips of where / how to apply the discussed mutual exclusion of annotation vs replicas field?

I think nothing. replicas cannot be set in a ClusterClass and it would be highly unusual to set the autoscaler annotations in a ClusterClass (as it would then apply to all Clusters that use a specific ClusterClass0.

ack, but it sounds like a plausible use case.

sbueringer commented 6 months ago

any tips of where / how to apply the discussed mutual exclusion of annotation vs replicas field?

You can add code here to iterate through newCluster.Spec.Topology.Workers.MachineDeployments and check for the annotations/replica fields.

ack, but it sounds like a plausible use case.

If we want to consider this case we additionally need: a) in the Cluster webhook: also consider the annotations a MD will get from the ClusterClass b) in the ClusterClass webhook: when the annotations are added there, check all Clusters using the ClusterClass (and the specific MD class) if they have replicas set

a is comparatively easy, b a bit messy but still doable. But what makes a a bit more complicated is that while we have already validation today that uses the ClusterClass in the Cluster webhook, it is basically optional validation. Because we cannot assume that a ClusterClass is deployed before the Cluster and we don't want to fail in this case.

neolit123 commented 6 months ago

thanks for the tips. i will send a PR for the base fix and see if i can also fix A and B.

EDIT: PR is here: https://github.com/kubernetes-sigs/cluster-api/pull/10370

neolit123 commented 6 months ago

there was some feedback on at the CAPI zoom call on April 3rd 2024: https://docs.google.com/document/d/1GgFbaYs-H6J5HSQ6a7n4aKpk0nDLE2hgG2NSOM9YIRw/edit#bookmark=id.yrm7uk6etmkk

there was some agreement bad things can happen if we leave this without the mutex. at least in a CAPV scenario.

@elmiko promised to TAL and had some good questions - e.g. how to handle the case if a user defined replicas value is set as > of the autoscaller annotation max. in that case the validation webhook will block them. i said maybe the user should pause the validation webhook, but @fabriziopandini explained this a bad practice. perhaps they should just temp. remove the annotations, update the replica count and the add the annotations back.

elmiko commented 6 months ago

the cluster autoscaler can act differently depending on the conditions of how a user modifies the replicas field:

if the user changes the replicas field to be outside the range for min or max size (for example, an MD has a minSize of 2 and maxSize of 5, if the users set to 1 or 6 it will be outside) then the autoscaler will not attempt to adjust that group further. the user must bring it back into alignment with the minSize and maxSize.
if the user increases the replicas but the value is less than or equal to the maxSize, eventually (after 10 minutes by default) the autoscaler will evaluate the extra nodes as under-utilized (assuming no workload has landed there) and remove the nodes.
if the user decreases the replicas but the value is greater than or equal to the minSize, if pods are evicted due to the deleted machines the autoscaler will see the pods as pending and create new nodes to satisfy those pods.

as long as the user is not modifying the replica count so that it is outside the min/max range, then the changes are disruptive but do not put the autoscaler into an unrecoverable state.

if the user modifies the replicas to be outside the min/max range, then they will need to remediate it manually.

if we make the annotations mutually exclusive with replicas field then we will need to be very clear about the process a user needs to take when triaging a misbehaving cluster to include the possibility of needing to remove the annotations if the replicas need adjusting.

another important note, as mentioned by @vrabbi in our community meeting today. the replicas field in ClusterClass objects can have very bad interactions with the cluster autoscaler if the user tries to control both the min/max annotations and the replicas.

neolit123 commented 6 months ago

thanks, or the details @elmiko appreciated!

neolit123 commented 6 months ago

another important note, as mentioned by @vrabbi in our community meeting today. the replicas field in ClusterClass objects can have very bad interactions with the cluster autoscaler if the user tries to control both the min/max annotations and the replicas.

i don't think there are replicas fields on a CC objects, unless i'm mistaken? https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/clusterclass_types.go

i was under the impression we are discussing the MachineDeploymentTopology replicas: https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/cluster_types.go#L187

vrabbi commented 6 months ago

Indeed not in CC but rather in CC based clusters

sbueringer commented 6 months ago

To make it slightly more complicated. There is also a --enforce-node-group-min-size flag on the cluster-autoscaler which will scale up to the min even if the current MD.spec.replicas are < min. (sorry just remembering this now) (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#my-cluster-is-below-minimum--above-maximum-number-of-nodes-but-ca-did-not-fix-that-why)

I think we have to focus the discussion a bit. I personally would focus in this issue on the ClusterClass case and the question "which configuration is clearly wrong" (so that we should block it in the cluster webhook). My take: (assuming min & max annotations are set)

Cluster.spec.topology...replicas < min: a. if the autoscaler is configured with --enforce-node-group-min-size:
- it is wrong as the autoscaler will try to scale at least to min and the cluster topology controller to the value of the replicas field b. if not:
- could be okay if a user intentionally wants to set replicas below min (and thus disable the autoscaler for the moment)
min <= Cluster.spec.topology...replicas <= max:
- This is always wrong, as the user delegates control of the replica field to the autoscaler, but the cluster topology controller will always enforce the value of the replicas field
Cluster.spec.topology...replicas > max:
- could be okay if a user intentionally wants to set replicas above max (and thus disable the autoscaler for the moment)

this means:

we can always block 2
we can decide if we want to allow configuring replicas in a way that the autoscaler becomes inactive because of "out of range"
- if yes
- allow 1 & 3 (we don't know if --enforce-node-group-min-size is set or not)
- if no
- block 1 & 3

Hope I didn't mix up anything.

sbueringer commented 6 months ago

Looking at this, I think I would keep it simple: block 1, 2, 3.

I think if users look at their cluster spec it shouldn't take them a long time and in-depth knowledge of CAPI & custer-autoscaler to figure out what is actually going on. If replicas & min/max annotations are set it's hard to figure out what's happening without all the knowledge we're just gathering here.

neolit123 commented 6 months ago

I think if users look at their cluster spec it shouldn't take them a long time and in-depth knowledge of CAPI & custer-autoscaler to figure out what is actually going on. If replicas & min/max annotations are set it's hard to figure out what's happening without all the knowledge we're just gathering here.

agreed.

Looking at this, I think I would keep it simple: block 1, 2, 3.

agreed, but i would like to exclude checking --enforce-node-group-min-size and just consider the annotations and replica values only (i.e. just block 1 and 3 without checking the flag). i would guess the only way to fetch the flag value is by parsing the autoscaller deployment and it's not managed in an API object anywhere.

other than that, 1,2,3 don't complicate the PR too much. the code that the PR adds would be simple after that, minus fetching the autoscaller deployment.

Q: to what numerical value should a min/max annotation value be set in the PR logic if the user entered a NaN by mistake e.g. foo for max.

sbueringer commented 6 months ago

agreed, but i would like to exclude checking --enforce-node-group-min-size and just consider the annotations and replica values only

Oh yup 100%. I was just writing up the options what could happen to think through the problem :)

Q: to what numerical value should a min/max annotation value be set in the PR logic if the user entered a NaN by mistake e.g. foo for max.

We should probably just validate the value and return an error if it's NaN

neolit123 commented 6 months ago

We should probably just validate the value and return an error if it's NaN

ok, understood. just return a standard validation error from the webhook.

neolit123 commented 6 months ago

@sbueringer just realized, if we are blocking on 1,2,3 doesn't that mean block on any number of replicas?

x < min || x >= min || x <= max || x > max

that's basically what the PR does right now, it just checks that there are min/max annotations, doesn't check their values and checks if replicas is non-nil.

sbueringer commented 6 months ago

Yup, that's what it means :)

Sorry for the confusion. Just wanted to write up all the branches and lay out what could happen in each. Just to eventually end up with that simple result

neolit123 commented 6 months ago

that's basically what the PR does right now, it just checks that there are min/max annotations, doesn't check their values and checks if replicas is non-nil.

in that case i would prefer to leave the annotation value validation to the autoscaller, i.e. not include number parsing of the annotations in these cluster and cc validating webhooks.

elmiko commented 6 months ago

+1 to making this as simple to understand as possible. from the user experience point of view, there should be no surprises and a clear path for triaging.

validating the min/max value is probably not worth the effort since we will be duplicating logic from the autoscaler, and adds an extra layer for possible confusion.

i absolutely agree with blocking number 2, the others i'm less clear on because there are users who rely on using the replica field to make a specific node group fall outside the autoscaler's scanning. i'm not sure how widespread this pattern is, but we will break it if we disallow changing the replicas while the annotations are present.

neolit123 commented 6 months ago

i absolutely agree with blocking number 2, the others i'm less clear on because there are users who rely on using the replica field to make a specific node group fall outside the autoscaler's scanning. i'm not sure how widespread this pattern is, but we will break it if we disallow changing the replicas while the annotations are present.

yes, that would break them. if we have concerns that this type of usage we can try to be safe and only apply 2. but IMO there should be a clearer contract and for 1, 3 the user pattern should be to just remove the annotations.

it would be even clearer if there is an explicit enabled annotation, but i'm not making a FR about it. :)

sbueringer commented 6 months ago

@fabriziopandini @chrischdi @vincepri @enxebre @killianmuldoon Any opinions on:

i absolutely agree with blocking number 2, the others i'm less clear on because there are users who rely on using the replica field to make a specific node group fall outside the autoscaler's scanning. i'm not sure how widespread this pattern is, but we will break it if we disallow changing the replicas while the annotations are present.

elmiko commented 6 months ago

should be a clearer contract and for 1, 3 the user pattern should be to just remove the annotations.

i tend to agree with this notion, but in conversations with end users i have heard them mention the complexity around the annotations and replicas behavior causes toil in some scenarios. i'm not sure that we can avoid it (although an operator to help with the capi/cas deployment might be a worthwhile investigation) as there is a necessary amount of configuration that has to be done.

vrabbi commented 6 months ago

It still doesnt fully block them either cause they can always change the replicas without removing the annotations by doing it directly on the machinedeployment resource instead of via the cluster resource if needed

sbueringer commented 6 months ago

And they could also change/set replicas and remove the annotations in Cluster.spec.topology. If the goal is to set replicas to a value where the autoscaler is not active they could do that. They just couldn't do it by only setting replicas to a value outside of (min,max) while keeping the annotations.

So I guess it's slightly more work for the user, but the result is more explicit (annotations are removed => so it's 100% clear that autoscaler is not active for everyone)

elmiko commented 6 months ago

So I guess it's slightly more work for the user, but the result is more explicit (annotations are removed => so it's 100% clear that autoscaler is not active for everyone)

agreed, this is my preference too. i just wanted to share some of the feedback i've heard.

chrischdi commented 6 months ago

I would also prefer to be as explicit as possible. We could take care that the returned errors which are surfaced to the user are as easy to understand as possible.

Less permutations, less questions and easier to understand.

fabriziopandini commented 6 months ago

+1 to block when the user sets replicas and the autoscaler annotation exists + the user has to explicitly remove the annotations to set replicas, no matter if replicas are within the min/max range or not

neolit123 commented 6 months ago

there seems to be consensus about the plan. PTAL at the implementation PR: https://github.com/kubernetes-sigs/cluster-api/pull/10370

fabriziopandini commented 6 months ago

/priority important-soon

kubernetes-sigs / cluster-api