GKE Autoscaling, creation of new node pool not working

jwgreene commented 3 years ago

The documentation states: Managed Cluster Strategy (e.g. GKE + EKS) When the turndown schedule occurs, a new node pool with a single g1-small node is created. Taints are added to this node to only allow specific pods to be scheduled there. We update our cluster-turndown deployment such that the turndown pod is allowed to schedule on the singleton node. Once the pod is moved to the new node, it will start back up and resume scaledown. This is done by cordoning all nodes in the cluster (other than our new g1-small node), and then reducing the node pool sizes to 0.

However when I add a new schedule (1.2.1 version of cluster-turndown but also happens on 1.3 snapshot) I see a label being added to one of my current nodes vs a new micro instance like it used to add. Is this some sort of bug, or expected change?

Thanks

jwgreene commented 3 years ago

Logs are showing: I0121 15:16:01.516995 1 namedlogger.go:24] [TurndownScheduler] Already running on correct turndown host node. No need to setup environment. I0121 15:16:01.517021 1 namedlogger.go:24] [Turndown] Scaling Down Cluster Now I0121 15:16:01.523464 1 namedlogger.go:24] [GKEClusterProvider] Loading node pools for: [ProjectID: monitoring-development, Zone: us-east1, ClusterID: monitoring-development] I0121 15:16:01.585311 1 namedlogger.go:24] [Turndown] Found Cluster-AutoScaler. Flattening Cluster... I0121 15:16:01.585333 1 namedlogger.go:32] [Flattener] Starting to Flatten All Deployments... I0121 15:16:05.798439 1 namedlogger.go:32] [Flattener] Starting to Flatten All DaemonSets... I0121 15:16:07.806832 1 namedlogger.go:32] [Flattener] Starting to Suspend All Jobs... I0121 15:16:07.826911 1 namedlogger.go:24] [Turndown] Resizing all non-autoscaling node groups to 0...

And on the node in my nodepool I see the label: cluster-turndown-node=true

afaik that should not be added to a node in my active pool since it should create a new pool with a micro instance

dwbrown2 commented 3 years ago

I believe this is WAI, but I'll let @mbolt35 add more detail!

jwgreene commented 3 years ago

If so it feels the documentation is out of date. It's been a few months since last time I used this, but it did at one time create a new node pool. Running on a node in the current pool is a bit more expensive than a micro instance :)

mbolt35 commented 3 years ago

@jwgreene A new node pool will not be created if you have the cluster-autoscaler enabled. Instead, we "flatten" deployments/daemonsets/jobs/statefulsets, etc... to allow the autoscaler to do the work. We noted early on that we didn't want to get into a tug of war spinning up and down nodes with the autoscaler, so instead we try our best to reduce usage on the cluster to allow the autoscaler to kick in. Hope this helps!

jwgreene commented 3 years ago

so what taint do I need to add to the node to get it to work? I tried adding a second node pool (with a micro instance), with what I thought was the proper taint, and it still spun up the pod on the initial node pool.

mbolt35 commented 3 years ago

@jwgreene The process of "flattening" should occur during the scheduled time. The autoscaler can be a bit passive sometimes and not scale down, which is something we've wrestled with while testing. A lot of the behavior depends on the provider (I don't think managed GKE can scale down to 0 nodes via autoscaler). I will look into some more details and possible improvements we could make here. In your current scenario, but in short, the turndown pod ensures that it's scheduled on a pool with autoscaler enabled (on GKE) to ensure that it is never removed (ie: it runs on the only remaining node). We could possibly add some customization around this process to allow for you to specify a "home node" for the pod.

jwgreene commented 3 years ago

ok thanks.. Our current process is one that we shut down the entire cluster on the weekends, but I would prefer to let autoscaler do it's work and get it to a small size vs as large as it is...

michaelmdresser commented 2 years ago

Apologies for resurrecting an old issue. I just spoke with another user of turndown and they had similar questions about the behavior with autoscaling. I've tried to improve the documentation a little bit further in https://github.com/kubecost/cluster-turndown/pull/35.

One further possible improvement is to have turndown (when configured to do so) edit the minimum node count of autoscaled node pools, though this might include a complicated provider integration. @mbolt35 if you have insight here I'd love to hear it.

kubecost / cluster-turndown

GKE Autoscaling, creation of new node pool not working #32