CAPI cluster autoscaler delete nodes from clusters it doesn't have access

jonathanbeber commented 1 year ago

Which component are you using?:

cluster-autoscaler for cluster API (CAPI).

What version of the component are you using?:

Component version: v.1.26.1

What k8s version are you using (kubectl version)?:

v.1.26.1

What environment is this in?:

CAPI clusters using cluster API Provider AWS (CAPA).

There are 2 clusters (at least) involved in this setup. One is the management cluster which is also managed by CAPI (self-managed). A second cluster is the workload cluster, managed by the management cluster.

+------------+             +----------+
|    mgmt    |<--          | workload |
| ---------- |   |         |          |
|    CAPI    +------------>|          |
+------------+             +----------+

In this setup cluster-autoscaler was not configured to have access to the workload cluster, it just has access to the management cluster kubernetes API via its service account. It has access to the CAPI objects of all clusters and tries to do that via its auto-discovery feature.

What did you expect to happen?:

I would expect cluster API to do not consider clusters where it doesn't have access to its kubernetes API.

What happened instead?:

Cluster autoscaler by its autodiscovery feature, identifies MachineDeployments for the workload cluster and since it can't access the kuberntes API assumes the nodes are unregistered and deletes them after 15 minutes (by default, configured by --max-node-provision-time).

I0214 20:53:39.989780       1 static_autoscaler.go:388] 4 unregistered nodes present
I0214 20:53:40.558558       1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:40.558597       1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-036e48326d72b8596
I0214 20:53:41.161843       1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:41.161875       1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-000951411145ad6ff
I0214 20:53:41.162028       1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.298995       1 leaderelection.go:278] successfully renewed lease kube-system/cluster-autoscaler
I0214 20:53:41.355255       1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.355389       1 clusterapi_controller.go:577] node "aws:///us-west-2a/i-000951411145ad6ff" is in nodegroup "MachineDeployment/default/as002-md-0"
I0214 20:53:41.758912       1 node_instances_cache.go:150] Invalidate entry in cloud provider node instances cache MachineDeployment/default/as002-md-0
I0214 20:53:41.758944       1 static_autoscaler.go:693] Removing unregistered node aws:///us-west-2a/i-03be732f299d68aa4

How to reproduce it (as minimally and precisely as possible):

Create a cluster via CAPI and CAPA and make it self-managed.
Create a second cluster managed by this first cluster.
Deploy cluster-autoscaler to the management cluster with the --cloud-provider=clusterapi flag. No further config required.
Annotate the MachineDeployments with cluster.x-k8s.io/cluster-api-autoscaler-node-group-[min|max]-size.
Check that in 15 minutes the MachineDeployment of the second cluster is scaled to its minimum even if the nodes are being utilized.

Anything else we need to know?:

It seems like currently cluster-autoscaler for CAPI does not plan to manage more than 1 cluster, since all the documented options allows to provide just one kubeconfig. We should document that and consider making the autodiscovery feature disabled by default.

That can lead to outages since the nodes might be under utilization.

I would like to work on this issue.

elmiko commented 1 year ago

~is the autoscaler managing a single cluster or all the clusters?~ oops, i saw topology later

in general, we do not recommend running a single autoscaler to cover multiple clusters. but, with that said, i would expect it to be ok for the autoscaler to manage a single cluster even if the workload hosts multiple clusters. (if the capi resources are isolated)

It seems like currently cluster-autoscaler for CAPI does not plan to manage more than 1 cluster, since all the documented options allows to provide just one kubeconfig.

this is by design, but also we do allow specify multiple kubeconfig, but only for a single "cluster" deployment. eg one config for management and one config for workload.

We should document that and consider making the autodiscovery feature disabled by default.

agreed about documenting this. i'm not sure we want to disable autodiscovery as users have come to expect that the autoscaler works that way, but i do wonder if we couldn't be doing more to identify the cluster owner. i'm not sure that is desirable, but something to think about.

jonathanbeber commented 1 year ago

thank you for taking a look, @elmiko.

agreed about documenting this.

Ok, perfect, I could already start working at least on this one. I guess the docs should alert that if the management cluster manages multiple clusters we recomend multiple cluster autoscaler (CA) deployments, and each instance of CA must filter its own cluster with --node-group-auto-discovery=clusterapi:clusterName=${CLUSTER_NAME}. WDYT?

i'm not sure we want to disable autodiscovery as users have come to expect that the autoscaler works that way, but i do wonder if we couldn't be doing more to identify the cluster owner. i'm not sure that is desirable, but something to think about.

I think I'm missing some context in here, but I don't see any benefit of autodiscovery for multiple clusters since CA won't be able to manage more than one cluster. I agree with you that autodiscovery for nodepools in the same cluster is something I would expect.

I'm thinking of two possible solutions in here, since IMHO just documentation is not a valid solution, because it can lead to serious incidents deleting nodes with some workload.

If nodepools of two different clusters are identified the cluster autoscaler refuses to downscale unregistered nodes (I don't like it very much);
The cluster autoscaler reads the annotation cluster.x-k8s.io/cluster-name from a registered node before accepting to manage a nodepool. This way we know that CA has access to the right Kubernetes API.

elmiko commented 1 year ago

Ok, perfect, I could already start working at least on this one. I guess the docs should alert that if the management cluster manages multiple clusters we recomend multiple cluster autoscaler (CA) deployments, and each instance of CA must filter its own cluster with --node-group-auto-discovery=clusterapi:clusterName=${CLUSTER_NAME}. WDYT?

+1, starting with a docs patch would be great imo. instructing users to filter on their clusterName as part of the autodiscovery also seems like a win.

I think I'm missing some context in here, but I don't see any benefit of autodiscovery for multiple clusters since CA won't be able to manage more than one cluster. I agree with you that autodiscovery for nodepools in the same cluster is something I would expect.

yeah, i wasn't suggesting that we should make it more tolerant to multi-cluster setups. i was saying what you suggested about adding the clusterName into the autodiscovery in a less eloquent manner ;)

I'm thinking of two possible solutions in here, since IMHO just documentation is not a valid solution, because it can lead to serious incidents deleting nodes with some workload.
1. If nodepools of two different clusters are identified the cluster autoscaler refuses to downscale unregistered nodes (I don't like it very much);

2. The cluster autoscaler reads the annotation `cluster.x-k8s.io/cluster-name` from a registered node before accepting to manage a nodepool. This way we know that CA has access to the right Kubernetes API.

i like option two, but then i wonder how would that work in a scale from zero scenario?

jonathanbeber commented 1 year ago

but then i wonder how would that work in a scale from zero scenario?

I don't think scaling from zero would be a problem, because at least one control plane has to be registered right? I can't think of a Kubernetes cluster that can scale to 0 all its nodes, I can imagine scaling to 0 just some nodepools. Do you think I might be missing something?

elmiko commented 1 year ago

i was thinking that perhaps we would need a way to signal the cluster-name from the scalable resource so that when a node group is scaling from zero replicas the autoscaler could ensure that it is focused on the proper node group. but, i think you are correct about grabbing the cluster name from a control plane node or similar, that should be relatively safe.

one of my main concerns here also is to ensure that we don't make this too impenetrable for others to understand, ideally we will solve this in a manner that is clear for users to understand how the cluster name factors in to the decisions.

jonathanbeber commented 1 year ago

one of my main concerns here also is to ensure that we don't make this too impenetrable for others to understand, ideally we will solve this in a manner that is clear for users to understand how the cluster name factors in to the decisions.

That's a very good concern!

I would like very much to have it clear for users too, but I can't think of any other way to make it clearer but documentation and good logs that will say something like "nodepool foo-bar is not considered for scaling up/down because it refers to a cluster different from the one identified in the workload cluster". I believe it is still not 100% clear, bu the term workload cluster is defined in the docs, the same docs we will provider further details on how we identify the cluster name (via the label in any registered node).

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jonathanbeber commented 1 year ago

/remove-lifecycle rotten

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jimmidyson commented 7 months ago

/remove-lifecycle stale

elmiko commented 7 months ago

i think this is a good issue to keep open, we just need to find someone who might want to propose the docs and logging change.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elmiko commented 2 months ago

still would like to see this fixed

/remove-lifecycle stale /help

k8s-ci-robot commented 2 months ago

@elmiko: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes/autoscaler/issues/5510): >still would like to see this fixed > >/remove-lifecycle stale >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / autoscaler

CAPI cluster autoscaler delete nodes from clusters it doesn't have access #5510

Guidelines