Select workloads to keep alive during turndown

kubecost / cluster-turndown

Automated turndown of Kubernetes clusters on specific schedules.

Apache License 2.0

259 stars 23 forks source link

Select workloads to keep alive during turndown #36

Open michaelmdresser opened 2 years ago

michaelmdresser commented 2 years ago

I'm relaying a user request.

They would like to be able to select specific workloads to keep alive (e.g. Kubecost, Prometheus, Grafana) during a turndown.

This behavior is a little complicated to implement, especially in a non-autoscaling environment. We could initially only support this feature in autoscaling environments but I'd need to do some research and testing.

Roadmap positioning of this feature isn't known yet, but I wanted to record it somewhere!

dwbrown2 commented 2 years ago

Have definitely heard this theme before... I think it can be broadened to say keep this set of labels, annotations, namespaces, etc. alive

michaelmdresser commented 2 years ago

Yes, absolutely agree with that broader definition. I'd bias towards supporting label-specified workloads/namespaces to start!

AjayTripathy commented 2 years ago

Isn't there a standard "do not evict" label for the cluster autoscaler that we can leverage?

dwbrown2 commented 2 years ago

Interesting ideas!

mbolt35 commented 2 years ago

Isn't there a standard "do not evict" label for the cluster autoscaler that we can leverage?

I don't think there is a "do not evict" label for autoscaler, but there does exist a "safe-evict" annotation that can be used to tell the autoscaler it can evict if necessary.

This use-case doesn't seem like a valid use of turndown, which is designed to shutdown the cluster from being used. It sounds like the behavior they're looking for is: "I want my cluster to scale down to a set of workloads." This can be accomplished by marking safe-evict on workloads they do not mind being downscaled. This doesn't require cluster-turndown at all.

mbolt35 commented 2 years ago

If we're only talking about clusters without autoscaling nodepools, it does seem like we've already written a lot of the foundational code which does this in the cluster-controller component (similar to the way one click cluster sizing worked):

Cluster Right-Size using Predefined Workloads (annotations, label, etc..) [This might require a small amount of work]
Send Cluster Spec to cluster-controller, which will automatically delete/resize nodepools.

This isn't on a schedule, and cluster-turndown doesn't have an implementation for pulling a cluster spec from cluster right-sizing. Either way, this still feels like a cluster-autoscaler solution with safe-evict annotations is the better way to go.