Unexpected behaviour using kubectl scale with a topology managed cluster

killianmuldoon commented 2 years ago

When using kubectl scale machinedeployment or kubectl scale kcp with a topology managed cluster the command succeeds but does not have the expected result. Depending on how quickly the topology controller reacts to the change the resource either doesn't appear to scale at all or else it scales up and then scales down quickly (in line with replicas specified in the cluster topology).

There's two options to deal with this IMO:

1) Intercept the scale request and send it to the Cluster Topology instead. 2) Block a scale request when the scalable resource is topology managed and replicas is set.

Blocking the request is the most feasible IMO as intercepting the request would be control plane specific i.e. we could implement it on KCP but it would have to be done in all other control plane providers. It would also mean adding a client to all scalable resources, and it subverts some expectations about how the API should work. When blocking we can return a useful error message explaining how to make a resource independently scalable i.e. unsetting replicas.

I haven't done a POC, but I think it would be feasible to both block changing the replicas field and block access to the scale subresource. This would have an impact on autoscaling, with the autoscaler returning a useful error when trying to improperly change a managed cluster's scalable resource rather than having the scale operation be apparently successful, but resulting in no spec change.

This is also, presumably, an issue when a MachineSet is managed by a MachineDeployment, but scaling an MS individually is probably a less established pattern.

/area topology /cc @jackfrancis @sbueringer

fabriziopandini commented 2 years ago

/triage accepted

Ideally, the cluster should become the scalable resource in this case, but I'm not sure we can achieve this given that there could be many scalable things in one cluster.

+1 to explore ideas about how to make this possible/less surprising for the users

killianmuldoon commented 2 years ago

Ideally, the cluster should become the scalable resource in this case, but I'm not sure we can achieve this given that there could be many scalable things in one cluster

Agreed - especially with multiple MachineDeployments this could become really unergonomic - i.e. kubectl scale cluster md $MD_NAME --replicas=3

sbueringer commented 2 years ago

Looking at how scale is implemented on the Kubernets side this means we need spec/status fields for replicas in the Cluster and I think we can't have multiple of them: https://github.com/kubernetes-sigs/cluster-api/blob/3b3d969c1587e8c6cf762b1d1b3ce98a977e4c29/config/crd/bases/cluster.x-k8s.io_machinedeployments.yaml#L508-L512

So I think making this happen somehow requires a very creative solution.

Intercept the scale request and send it to the Cluster Topology instead.

I can't imagine a good solution for this

This would have an impact on autoscaling, with the autoscaler returning a useful error when trying to improperly change a managed cluster's scalable resource rather than having the scale operation be apparently successful, but resulting in no spec change.

We would have to make sure we block this request only if the topology controller actually cares about the replicas field. Otherwise we break the autoscaler.

Additionally, I have no idea if the autoscaler is using update/patch or the scale subresource.

sbueringer commented 2 years ago

Agreed - especially with multiple MachineDeployments this could become really unergonomic - i.e. kubectl scale cluster md $MD_NAME --replicas=3

This CLI syntax 100% requires a KEP for kubectl as it's not supported today.

killianmuldoon commented 2 years ago

Additionally, I have no idea if the autoscaler is using update/patch or the scale subresource.

I think if we implement either of the solutions it would have to work for both the scale subresource and for updating the replicas field.

This CLI syntax 100% requires a KEP for kubectl as it's not supported today.

I'm definitely not up for proposing it :laughing: - just expressing the level of granularity this would need because of the multiple types of scalable resource and the multiple actual resource in the Cluster Topology.

sbueringer commented 2 years ago

Just wanted to make clear that we're talking about changing a feature / trying to workaround a feature which is almost entirely implemented upstream (kubectl / kube-apiserver) :)

killianmuldoon commented 2 years ago

Just had a look - it looks like the Cluster Autoscaler uses https://pkg.go.dev/k8s.io/client-go/scale which works on the scale subresource. But we can't be sure this is the case for other autoscaler implementations of course.

jackfrancis commented 2 years ago

Thanks for facilitating this discussion @killianmuldoon

@elmiko a brief tour through the capi provider of cluster-autoscaler suggests that it doesn't work with ClusterClass at all

Maybe this option will solve for a lot of the issues we're observing:

Intercept the scale request and send it to the Cluster Topology instead.

But it would probably be better to have autoscaler be ClusterClass-aware, and initiate scale events against the appropriate Cluster sub-resource instead of the MachineDeployments (and when MachinePools support lands for autoscaler, same idea).

Does that sound right?

fabriziopandini commented 2 years ago

I was thinking at this and looking at options to make it work for the 80% of the use cases

Is it acceptable to say that kubectl scale should scale the first MachineDeployment?

Why this approach? I assume scaling the control plane is not a frequent operator, so I don't see value in providing a shortcut on this operation (it could also be seen as a sort of protection to the control plane)

So, with the CP out of the table, kubectl scale should target MachineDeployments, and assuming having only 1 MD is the most common use case, it should target the first

Does this makes sense?

fabriziopandini commented 2 years ago

Wrt to the auto scaler, AFAIK we are not setting replicas in the cluster topology, so everything should work as usual

killianmuldoon commented 2 years ago

The first MachineDeployment is probably the 80 percent case but IMO it's not a great solution to this problem. It's not extensible to cover any other use case which makes it confusing and frustrating for users. It introduces inconsistency into the management of Clusters, and in the API which could cause further complications down the line.

I think blocking the request is probably a better solution, even if it means losing kubectl scale.

sbueringer commented 2 years ago

I would definitely prefer not adding the necessary replica fields to Cluster to be able to make kubectl scale work

jackfrancis commented 2 years ago

+1 on not doing the "just choose the first MachineDeployment" approach, as that seems too non-deterministic

elmiko commented 2 years ago

from @jackfrancis

@elmiko a brief tour through the capi provider of cluster-autoscaler suggests that it doesn't work with ClusterClass at all

correct, autoscaler does not know about ClusterClass

Maybe this option will solve for a lot of the issues we're observing:

Intercept the scale request and send it to the Cluster Topology instead.

But it would probably be better to have autoscaler be ClusterClass-aware, and initiate scale events against the appropriate Cluster sub-resource instead of the MachineDeployments (and when MachinePools support lands for autoscaler, same idea).

Does that sound right?

i have to admit that i don't know all the details of the relationship between ClusterClass and MachineDeployment, but certainly we could have the autoscaler become aware of ClusterClasses and then add that into the workflow that it follows. it sounds like there might be a case where the autoscaler would detect that a ClusterClass is owning (not sure if that is correct term) /some/ scalable resource (Machine[Deployment|Set|Pool]) and when the autoscaler wants to scale that resource then we would make a decision to choose which scalable resource gets updated. do i have that correct?

from @fabriziopandini

Wrt to the auto scaler, AFAIK we are not setting replicas in the cluster topology, so everything should work as usual

given that, we might not need to do anything unless we want to build this out in the autoscaler.

fwiw, i think we should keep the autoscaler as simple as possible with respect to the scalable CAPI resources.

jackfrancis commented 2 years ago

I just want to be clear for any other folks reading this thread now and in the future: the cluster-api cluster-autoscaler not only doesn't know about ClusterClass, but it will 100% not work. So for folks running ClusterClass-enabled Cluster API clusters, you can not use cluster-autoscaler at this time.

Someone correct me if I'm wrong!

sbueringer commented 2 years ago

I just want to be clear for any other folks reading this thread now and in the future: the cluster-api cluster-autoscaler not only doesn't know about ClusterClass, but it will 100% not work. So for folks running ClusterClass-enabled Cluster API clusters, you can not use cluster-autoscaler at this time.

I think this is not correct. If you want to use autoscaler, you just don't set a replica field in Cluster.spec.topology.workers..... The topology controller will create the MD without specifying replicas. Defaulting will set replicas to 1. From then on the cluster-autscaler can modify the replicas in MD and the topology controller doesn't care and doesn't enforce any value on the MD spec field.

We have a limitation regarding ClusterClass and auto scaler, but that is that today we can't set top-level annotations on MDs via Cluster.spec.topology to "fine-tune" the autoscaler. (@elmiko correct me if I'm wrong and the min/max annotations (I don't remember which) are mandatory, if they are, then autoscaler really doesn't work right now with ClusterClass until we implemented the annotation propagation)

enxebre commented 2 years ago

Hey folks, great dicussion, can you help understand and try to scope down what's the concrete issue / use case here?

Unexpected behaviour using kubectl scale with a topology managed cluster When using kubectl scale machinedeployment or kubectl scale kcp with a topology managed cluster the command succeeds but does not have the expected result. Depending on how quickly the topology controller reacts to the change the resource either doesn't appear to scale at all or else it scales up and then scales down quickly (in line with replicas specified in the cluster topology).

To me, the above describes expected and reasonable behaviour. If you express intent for a resource that is managed by another entity (clusterClass in this case), they will fight. That's kube by design, entities being clusterClass, kubectl, autoscaler or anything else is circumstantial here. This is basically a tradeoff of using clusterClass that should be documented, as a user you have deliberately chosen to use an authority for managing the underlying resources. Now, if we consider this insufficient/poor UX I think we should focus on making our semantics/primitives more meaningful and flexible, I don't think we should be implementing non standard kubectl logic or non standard API behaviour (I'd find that to be truly unexpected behaviour).

But then, based on @sbueringer comment it seems it is already the case that we support this more flexible use case and that the undesired UX @killianmuldoon describes that originated this issue has already a very reasonable alternative, i.e.

you just don't set a replica field in Cluster.spec.topology.workers..... The topology controller will create the MD without specifying replicas. Defaulting will set replicas to 1

So all seems fairly reasonable to me as it is (pending fixing the limitation on annotations Stefan mentioned). Now, there's some discussion around wanting a UX to use the Cluster as a scalable resource that signal to the underlying scalable resources: Can we describe use cases for this? In my mind, you'll usually want to scale a particular scalable resource which would allocate a particular group of workloads (This is true for autoscaler as well, it knows about homogenous nodeGoups). This is possible today already by using scale sub resources over exiting scalable resources MD/MS/MP (you just don't set a replica field in Cluster.spec.topology.workers) or expressing intent in clusterClass as the source of truth. Thinking of cluster as a first class scalable resource feels artificial to me.

elmiko commented 2 years ago

We have a limitation regarding ClusterClass and auto scaler, but that is that today we can't set top-level annotations on MDs via Cluster.spec.topology to "fine-tune" the autoscaler. (@elmiko correct me if I'm wrong and the min/max annotations (I don't remember which) are mandatory, if they are, then autoscaler really doesn't work right now with ClusterClass until we implemented the annotation propagation)

correct, they are required and must be on only the MD or MS.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini commented 6 months ago

/kind feature /priority backlog

sbueringer commented 6 months ago

My tl;dr (correct me if I'm wrong)

A proper way to address this would be the following

We would have to make sure we block this request only if the topology controller actually cares about the replicas field.

So this means a webhook on the scale subresource of KCP & MD which blocks the scale request when: a) KCP & MD belong to a ClusterClass based cluster b) the corresponding replica field is set in Custer.spec.topology

kubernetes-sigs / cluster-api

Unexpected behaviour using kubectl scale with a topology managed cluster #7395