Cluster Autoscaler CAPI provider should support scaling to and from zero nodes

elmiko commented 4 years ago

As a user I would like the ability to have my MachineSets and MachineDeployments scale to and from zero replicas. I should be able to set a minimum size of 0 for a Machine[Set|Deployment] and have the autoscaler take the appropriate actions.

This issue is CAPI provider specific, and will require some modifications to the individual CAPI providers before it could be merged in the autoscaler code.

elmiko commented 4 years ago

/area provider/cluster-api

seh commented 4 years ago

How will the autoscaler determine which labels and taints to expect on nodes for its scheduling simulation? I see the taints may be available in the kubeadm NodeRegistrationOptions type.

elmiko commented 4 years ago

@seh, if i understand your question correctly this is information is handled through the labels and taints on the MachineSets and MachineDeployments. when these resources are set to a minimum size of 0 and the autoscaler has removed all the Machines and Nodes, the MachineSets and MachineDeployments contain labels and taints which are used during the scale up process. the labels and taints will be applied to the new Node resources as they are created.

seh commented 4 years ago

Before asking that question, when I went looking at the newest MachineSet type definition, I didn't see anything there about taints. Drilling down further into MachineSpec, it's not there either.

The only place I could find them was in the kubeadm NodeRegistrationOptions type. That's why I asked where you'll find the taints. Did I miss a pertinent field here?

elmiko commented 4 years ago

Did I miss a pertinent field here?

no, i don't think you missed something, i think i may have missed something ;)

i have been working from a branch of the cluster-api code to test this behavior locally and with openshift. to make this work in our branch, we have the Taints persisted through at the MachineSpec level. i think there will need to be some work done in the cluster-api project to expose this functionality, or at least a little deeper research.

there are other changes that will need to happen in CAPI as well, mainly around saving information about cpu/memory/gpu as well. your point about the taints is well placed though, i will add this to the list of changes.

seh commented 4 years ago

For the machine resources, I figured that we'd do something like dive down to figure out the cloud provider and machine/instance type, and then consult the catalogs available elsewhere within the cluster autoscaler. I'm most familiar with AWS, and for that provider there used to be a static (generated) catalog, but now we fetch it dynamically via the AWS API when the program starts. With that catalog, you can learn of the machine's promised capabilities.

Perhaps, though, in the interest of eliminating dependencies among providers, the Cluster API provider would be blind to that information, which is would be an unfortunate loss.

elmiko commented 4 years ago

for the machine/instance resources, the solution i am working from currently is that the individual providers on the CAPI side will populate annotations in the Machine[Set|Deployment] that instruct on the cpu, memory, gpu, etc.

the method i am currently using has lookup tables for each provider (contained within the provider code) to assist in creating the resource requirements. i think having these values be dynamically populated by the CAPI side of things would certainly be worth looking into. ultimately though, the idea would be for each provider to own their implementation of the resource requirements, with a group of standard annotations that the autoscaler can use to assist in creating the machines for that group.

the information does come from the CAPI providers though, not from the autoscaler providers.

seh commented 4 years ago

Understood. So long as it's all accurate and not too hard to maintain, that sounds fine.

What we ran into with the AWS provider for the autoscaler was that the catalog would fall out of step, which required generating fresh code, releasing a new autoscaler version, and then deploying that new container image version into clusters. AWS was coming out with new instance types often enough that that whole process felt too onerous. It seems that these new instance types come in waves. It's hard to balance the threat of falling out of data with the threat of the catalog fetching and parsing failing at run time.

elmiko commented 4 years ago

that's an excellent point about the catalog falling out of step. if i understand the provider implementations though CAPI properly, and i might not ;) , we are using values for cpu, memory, etc, that the individual CAPI providers then turn into actual instance information at the cloud provider layer. so, in theory, this could be a call to the CAPI provider at creation time, eg. "give me a Machine that has X cpu slices, Y ram, and Z gpus" then the CAPI provider could either use a lookup table if appropriate or make some dynamic call to the cloud provider api.

edit: added some context to the overloaded "provider" terms

seh commented 4 years ago

to make this work in our branch, we have the Taints persisted through at the MachineSpec level. i think there will need to be some work done in the cluster-api project to expose this functionality, or at least a little deeper research.

Are there any open CAPI issues about this gap? Do you know if anyone is working on exposing the node taints and labels there? (Perhaps we can already get the labels.)

elmiko commented 4 years ago

Are there any open CAPI issues about this gap? Do you know if anyone is working on exposing the node taints and labels there? (Perhaps we can already get the labels.)

i do not think issues have been opened on the CAPI side yet, there will need to be some discussion there about passing information about the node sizes through the CAPI resources. i am working from a proof of concept that has this working for aws, gcp, and azure, in which we use annotations for passing this information.

ideally i would like to contribute these patches back to the CAPI project, and bring the associated changes here as well, but i think we need to have a discussion on the CAPI side about this as it will require changes to several repos and some agreement about the method for passing information.

and we haven't even touched on the taints yet ;)

seh commented 4 years ago

i think we need to have a discussion on the CAPI side about this as it will require changes to several repos and some agreement about the method for passing information.

Would you mind if I bring this up for discussion in the "cluster-api" Slack channel? I'd like to get a feel for how much work and resistance lies ahead, as I don't think we can adopt the cluster autoscaler with CAPI until we close this gap.

elmiko commented 4 years ago

Would you mind if I bring this up for discussion in the "cluster-api" Slack channel? I'd like to get a feel for how much work and resistance lies ahead, as I don't think we can adopt the cluster autoscaler with CAPI until we close this gap.

please do!

if you'd like, we can bring this up during the weekly meeting today as well?

elmiko commented 4 years ago

@seh just wanted to let you know that we talked about this at the CAPI meeting today, i don't think we have consensus yet but i didn't hear any hard objections. i think the next steps will be to do a little research around some other approaches to gather the cpu/mem/gpu requirements, and then create an enhancement proposal to discuss with the CAPI team.

CAPI meeting minutes 2020-06-10

seh commented 4 years ago

That's great to hear. I'm sorry I wasn't able to attend the meeting today. I do see the topic covered in the agenda/minutes, though, so thank you for bringing it up.

I don't know yet what I can do to help make progress on this front. I have experience with kubeadm and the cluster autoscaler, but little with CAPI and CAPA so far. If you'd like review or help with the KEP, please let me know.

elmiko commented 4 years ago

I don't know yet what I can do to help make progress on this front. I have experience with kubeadm and the cluster autoscaler, but little with CAPI and CAPA so far. If you'd like review or help with the KEP, please let me know.

i think the next steps will be to make a formal proposal to the CAPI group for getting this change into their releases, and then coordinating the autoscaler changes. i'm happy to CC you on any issues that come up around this, and perhaps we can work to get them merged. if you are interested in getting more involved with the CAPI provider code, i'm sure we could collaborate on getting the necessary changes in place.

seh commented 4 years ago

I brought up some of these questions in the "cluster-api" Slack channel. See kubernetes-sigs/cluster-api#2461 for an overlapping request.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

elmiko commented 4 years ago

/remove-lifecycle stale

unixfox commented 3 years ago

Hello,

Sorry for the noise but I just wanted to say that I'm also interested into this issue for mostly deploying temporary workloads like Minecraft servers, coding environments (like GitHub codespaces) and more.

This would also close the gap even further for the features available between self-hosted autoscaler and autoscaler from managed Kubernetes solutions like DigitalOcean. For instance thanks to some projects like machine-controller that implements cluster-api, it's possible to use our own autoscaler on DigitalOcean and even on unsupported cloud providers like Scaleway, Hetzner, Linode and more.

elmiko commented 3 years ago

@unixfox just by means of an update, i have been working on a proof of concept for scaling from zero with capi. it's been going slower than i expected, but i feel we have good consensus about the initial implementation and with any luck :four_leaf_clover: i should have something to show in early january.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

unixfox commented 3 years ago

/remove-lifecycle stale

elmiko commented 3 years ago

thanks @unixfox , i am still working on a PoC for this. i have made good progress though, hopefully have a demo soon.

elmiko commented 3 years ago

for those following this issue, i have created this PR https://github.com/kubernetes-sigs/cluster-api/pull/4283 based on a proof of concept i am working on.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

unixfox commented 3 years ago

/remove-lifecycle stale

elmiko commented 3 years ago

thanks for the bump @unixfox , i continue to hack away on this. the design has changed slightly since the first round of work on the enhancement. i need to update the enhancement and would like to give a demo at an upcoming cluster-api meeting.

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

unixfox commented 3 years ago

Not sure if I would mark this issue as fresh or not. I stopped having the need for cluster autoscaler with Cluster API, but it's still a cool feature that have a lot of potential when trying to use cluster autoscaler on "unsupported" cloud providers.

elmiko commented 3 years ago

i am still working towards this issue. we almost have agreement on the cluster-api enhancement, and i think it will merge in the next few weeks. then i will post a PR for the implementation.

@unixfox sorry to hear that we weren't able to deliver this feature in a time that would be helpful to you. i do appreciate your support though =)

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elmiko commented 2 years ago

the upstream cluster-api community has approved the proposal for the scale-from-zero feature. i am in the process of writing a patch that will satisfy the proposal, and also updating the kubemark provider to work with scaling from zero. i imagine this work won't be done till january, hopefully we will have it in for the 1.24 release of the autoscaler.

/remove-lifecycle stale

davidspek commented 2 years ago

@elmiko Do you have a link to a PR we can follow?

elmiko commented 2 years ago

@DavidSpek i am hoping to have the PR ready next week, you can follow my progress on this branch for now https://github.com/elmiko/kubernetes-autoscaler/tree/capi-scale-from-zero

i have it working, but i need to do some cleanups around the dynamic nature of the client, and also add some unit tests. there is a complicated problem to solve wherein we need the client to become aware of the machine template types after it has started watching machinedeployments/machinesets, so that we can accurately set up the informers to watch the templates. i have the basic mechanism working on my branch, i'm just trying to make the dynamic client better now.

davidspek commented 2 years ago

@elmiko Thanks for the info. I hope to have some time to test your changes soon. Do you maybe have a link to the Cluster API docs for infrastructure providers to support scale from 0? I haven’t been able to find that myself.

elmiko commented 2 years ago

@DavidSpek my hope is that the enhancement[0] has enough details for a provider to implement scale from zero. if you find that there is detail lacking, please ping me as i would like to improve that doc =)

[0] https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20210310-opt-in-autoscaling-from-zero.md

davidspek commented 2 years ago

@elmiko Thanks for the doc, I think that’ll likely answer most of my questions. Has that proposal already been accepted? Or more importantly, can this already be implemented in infrastructure providers without needing to change anything in the cluster api core library?

elmiko commented 2 years ago

@DavidSpek yes it has been accepted, and no it should not require any changes in the core cluster-api.

i was able to implement scale from zero in the kubemark provider without modifying the core, you can see my PR here https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/pull/30

davidspek commented 2 years ago

@elmiko Awesome, thank you very much for all the info and quick responses.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

elmiko commented 2 years ago

PR is currently under review for this, #4840 /remove-lifecycle stale

kubernetes / autoscaler

Cluster Autoscaler CAPI provider should support scaling to and from zero nodes #3150