metal3-io / metal3-docs

Architecture documentation that describes the components being built under Metal³.
http://metal3.io
Apache License 2.0
267 stars 113 forks source link

proposal: add auto-scaling for MachineSets #83

Closed mhrivnak closed 3 years ago

mhrivnak commented 4 years ago

In many clusters, it is desirable for the size of a MachineSet to always equal the number of matching BareMetalHosts. In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes, and they want to remove excess Machines in case they remove hosts from their cluster.

Rather than make some external process manage the size of MachineSets as BareMetalHosts come and go, we could create a small controller that (optionally) automatically ensures a MachineSet has a size equal to the number of matching BareMetalHosts.

The controller would be an additional Controller in this project. It would watch MachineSets as its primary resource, and if they have a particular annotation, ensure that their size equals the number of matching BareMetalHosts. It would watch BareMetalHosts as a secondary resource.

Thoughts?

dhellmann commented 4 years ago

This makes sense to me. I'm not sure if there are realistic use cases for having inventory in a cluster that isn't being consumed by the cluster. I can't really think of good reasons for doing that off the top of my head.

andybraren commented 4 years ago

I can imagine this enabling some great UX improvements. 👍

IIRC @dhellmann you once suggested the possibility of using a few Available/Ready (non-Provisioned) BMHs to create a brand new cluster using the first cluster as a sort of... bootstrap cluster? That might be easier than going through the usual install process and setting up a bootstrap node, and could be relatively common in (non-Edge) multi-cluster environments where nodes are roughly collocated. Maybe. 🤷‍♂️

This proposal doesn’t really preclude that flow I suppose. Some BMHs might just have to be deprovisioned before turning into a new cluster, which I’d expect to be a valid path regardless.

andybraren commented 4 years ago

If it ends up being the case that this autoscaling behavior is desired more often than not, would it make sense for it to be on by default and the annotation would turn it off instead?

dhellmann commented 4 years ago

I can imagine this enabling some great UX improvements. 👍

IIRC @dhellmann you once suggested the possibility of using a few Available/Ready (non-Provisioned) BMHs to create a brand new cluster using the first cluster as a sort of... bootstrap cluster? That might be easier than going through the usual install process and setting up a bootstrap node, and could be relatively common in (non-Edge) multi-cluster environments where nodes are roughly collocated. Maybe. 🤷‍♂️

The OpenShift installer doesn't really support that today, but it could be made to work. And the v1alpha2 work being done in metal3 already supports this flow for standard kubernetes clusters using a newer machine API.

This proposal doesn’t really preclude that flow I suppose. Some BMHs might just have to be deprovisioned before turning into a new cluster, which I’d expect to be a valid path regardless.

Yeah, I think this proposal is asking us to go all-in on the idea that there is no unused inventory in a cluster.

zaneb commented 4 years ago

In such a scenario, the cluster owner wants all of their hardware to be provisioned and turned into Nodes,

I'm not completely convinced by this - in the OpenStack world operators generally complain about the fact that all of the hardware is always provisioned and in use. There's a real cost (in terms of electrical power consumption) to running servers that are not needed. Currently the cluster-autoscaler does not integrate with the cluster-api, but when it does it seems to me that that's what you would want managing the MachineSet size.

One Baremetal-specific scenario that does not account for is that in the simple case where you have only one cluster, it would be advantageous to be able to keep all of the Hosts provisioned and only toggle the power as you bring them in and out of the cluster. My first impression though is that this would need to be handled at a level below the Machine API.

I could buy that in a hyperconverged storage scenario you might want to keep all of the available Hosts in the cluster all of the time. I wonder if that could be better handled by rook (or whatever hyperconverged storage operator) tweaking the cluster-autoscaler parameters appropriately though, rather than writing a competing autoscaler.

and they want to remove excess Machines in case they remove hosts from their cluster.

This is more understandable, although if there are insufficient Hosts available I don't think anything bad happens; you just get some Machines hanging around that can never turn into Nodes. I don't know whether or not the cluster-autoscaler will handle this case for you (i.e. notice that nothing bad is happening with the current number of Nodes, yet the MachineSet size is larger, therefore contract the MachineSet to match).

mhrivnak commented 4 years ago

Powering down hardware when not needed is a different story than deprovisioning hardware when not needed. Provisioning is expensive and time-consuming. If we apply a cluster-autoscaler to a bare metal cluster, once the autoscaler decided it needs more capacity, it could easily be 30+ minutes (worse in many cases) before new capacity was done provisioning and became available. Perhaps that's a constraint someone would be willing to live with, but we haven't received that request yet AFAIK. It seems like scale-by-provisioning with that level of latency would be a better fit for workloads that are time-of-day specific; if you can anticipate when demand will increase, you can proactively begin re-provisioning dark hardware. (like the thermostat in my house that turns on the heat ~30 minutes before I wake up)

If we really wanted to pursue load-based cluster autoscaling with bare metal, I think we would be much better served looking at being able to suspend or hibernate systems rather than deprovision them.

In the mean time, we do have a multi-cluster use case where inventory is managed at a level above clusters. We're either going to build logic into that thing to scale MachineSets up and down as it adds and removes inventory in a specific cluster, or put that logic into the provider running on the cluster. I think doing it in the provider makes more sense and would enable more re-use. Since it's optional and opt-in (you have to annotate a MachineSet to get the behavior), there's no harm for someone who wants to scale their MachineSets another way.

zaneb commented 4 years ago

It feels like we might be missing a concept like a BareMetalHostSet - where each Host in the set would be provisioned with the configuration defined in the MachineSet, but not powered on until it is associated with a Machine. In a standalone cluster, you'd typically use something like what is proposed here, to make sure that all matching Hosts are always in the HostSet; in more specialised deployments or a multi-cluster environment you'd have a scheduler + reservation system that would assign Hosts to HostSets according to actual + projected demand (just need somebody to come up with an AI angle here ;).

I think we should try to avoid needing a baremetal-specific cluster-autoscaler cloud provider to implement these kinds of use cases.

dhellmann commented 4 years ago

Aren't at least some of the settings for the host time-sensitive? I'm thinking about the certs used for the host to identify itself and register with the cluster. Those have a limited lifetime, right? If we pre-provision a host, then power it off, when it boots again we might have to do more than power it on to make it usable.

zaneb commented 4 years ago

Good question. If there is stuff that is specific to a particular Machine passed in the userdata then probably the best we can hope for is to be able to rebuild the host in Ironic to update the config-drive, but I assume that still involves rebooting into ironic-python-agent and back again, so it'd be roughly as slow as provisioning (IIUC it's mainly having to test that amount of RAM on startup that makes things so slow?).

mhrivnak commented 4 years ago

I can see something like that being valuable in some cases, bet we're getting into use cases that go well beyond the scope of this request. If we're interested in pursuing the ability to pre-provision hosts, adjust cluster size based on load, or power down hosts for energy conservation, let's make issues for those use cases and discuss them there.

Many users will just want to provision whatever hardware they tell the cluster about, and that's the use case I'm trying to address. Rather than make it a two-step process of 1) add or remove a BareMetalHost and 2) increment or decrement the corresponding MachineSet (an inherently imperative operation BTW), we can reduce that to one step by letting the user declare with an annotation that they want their MachineSet size to match what's available.

Are there objections to that? It's opt-in, requiring an annotation to be placed on the MachineSet, so default behavior is unchanged. The code is going to be written; if not here, then other tools that want to add and remove BareMetalHosts will need to implement it. For example, multi-cluster tooling that's coming together will need this behavior. I'd rather do it here so we can provide a consistent behavior and let one implementation be re-used. I'm also happy to implement it as long as nobody objects.

zaneb commented 4 years ago

It seems like we're not near to figuring out the shape of the solution for those more complex use cases, so I agree we shouldn't block this.

stbenjam commented 4 years ago

/kind feature

stbenjam commented 4 years ago

/kind feature

metal3-io-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

dhellmann commented 4 years ago

/remove-lifecycle stale

metal3-io-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

metal3-io-bot commented 3 years ago

Stale issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle stale.

/close

metal3-io-bot commented 3 years ago

@metal3-io-bot: Closing this issue.

In response to [this](https://github.com/metal3-io/metal3-docs/issues/83#issuecomment-717999757): >Stale issues close after 30d of inactivity. Reopen the issue with `/reopen`. Mark the issue as fresh with `/remove-lifecycle stale`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.