Open tmmorin opened 1 year ago
/cc @hardys
/cc @furkatgofurov7
If the tool used to managed Metal3Machine and Metal3MachineTemplates resources does garbage collection, it is possible that it would remove a Metal3MachineTemplate before the Metal3Machine using it is removed. The result is that nodeReuse can't reliably be used.
I am unsure what tool is being used, but this is a breaking point of the node-reuse workflow in your use case. The feature was implemented as part of a "bigger" feature where we have the following use case:
Are you using scale-in feature, by any chance and would that help?
If the tool used to managed Metal3Machine and Metal3MachineTemplates resources does garbage collection, it is possible that it would remove a Metal3MachineTemplate before the Metal3Machine using it is removed. The result is that nodeReuse can't reliably be used.
I am unsure what tool is being used, [...]
In our case, a Helm chart defines the CAPI/Metal3 manifests (https://gitlab.com/sylva-projects/sylva-elements/helm-charts/sylva-capi-cluster), and it is instantiated by a FluxCD HelmRelease.
But my understanding is that the issue discussed here is quite generic: any tool doing garbage collection would have the same issue (e.g kustomize used with a GitOps tool like FluxCD or Argo, or a CRD/operator generating CAPI/Metal3 manifests for a cluster, etc). Also, I would tend to think that any tool not doing garbage collection on unused resource templates would be problematic: garbage collection is something that is needed to not leak resources on the long run.
but this is a breaking point of the node-reuse workflow in your use case. The feature was implemented as part of a "bigger" feature where we have the following use case:
- We have a node which we want to keep the data available on the disk without being cleaned and later the same node reused, since there is no point of keeping the disk of the node intact and not reusing it. So, in that sense, automated-cleaning and node-reuse are used in bundle.
Yes, this is how we intend to use nodeReuse: along with disabling automated cleaning. The use case is explicitly to preserve PV data.
- Node-reuse feature is heavily dependent on the scale-in feature of control-plane/MD, where machines are removed one-by-one before creating new ones (delete-create method), so that machine controller has a chance to set the label on the host and do the filtering of it in the next provisioning.
Are you using scale-in feature, by any chance and would that help?
Well, yes, the intent definitely is to use maxSurge: 0
for this purpose (we worked with RKE2 bootstrap provider devs @richardcase @alexander-demicev to have this control plane support maxSurge).
Using maxSurge zero helps yes, in the sense that it is necessary for nodeReuse to make any sense, but this does not address the issue raised in this issue: we need a way to preserve the Metal3MachineTemplates for nodeReuse to be effective.
/triage accepted /help
@Rozzii: This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
@lentzi90 @kashifest @zaneb
could finalizers or ownerRefs be used to block the deletion of Metal3MachineTemplates until no Metal3Machine remains that would refer to it via a cluster.x-k8s.io/cloned-from-name-x annotation ?
I think this could be doable (finalizers, not ownerRefs). However, I have a few things to note first.
could finalizers or ownerRefs be used to block the deletion of Metal3MachineTemplates until no Metal3Machine remains that would refer to it via a cluster.x-k8s.io/cloned-from-name-x annotation ?
I think this could be doable (finalizers, not ownerRefs).
:+1:
However, I have a few things to note first.
1. The infrastructure provider (CAPM3) is not normally interacting the InfraMachineTemplate (M3MT). That CAPM3 checks the nodeReuse from it is an exception. Normally it is just CAPI that reads the InfraMachineTemplate and creates InfraMachines from it. This indicates to me that we need to consider carefully before we introduce more "unusual" behavior. Perhaps we should also consider removing the check we have for nodeReuse.
I had the same reaction.
I'm actually wondering if a solution would not be to ensure that Metal3Machine would keep track of the intention of of having nodeReuse expressed in the template. For instance by having a spec.nodeReuse field (?).
2. You talk about GC. That would to me indicate that the templates are no longer used, but in fact they are so this seems like the wrong terminology. I understand the issue though: you want to delete the old template at the same time that you create the new. In the "manual" case I would expect users to keep the old M3MT for at least one cycle so that it is easy to roll back in case of issues. I.e. if we start with m3mt-1 and roll to m3mt-2 I would expect m3mt-1 to be left until I start rolling to m3mt-3.
Yes, this is the idea (or until m3mt-1 is not used by any Machine).
My use of the "garbage collection" termed was only about deleting m3mt-1 once it stopped being used.
Would this be possible? If not, how do you handle rollbacks? What if the rollout gets stuck?
I would definitely not propose to delete m3mt-1 before the rollout with m3mt-2 is done.
3. Have you considered using a ClusterClass? I think this could help with the issue but we need to implement it first ([Implement the ClusterClass API #1267](https://github.com/metal3-io/cluster-api-provider-metal3/issues/1267)). It could also introduce more issues though... With a ClusterClass you would also need to switch from one version to the next and I would always expect to have the previous ClusterClass available at least during the transition, since I would not want to switch all clusters at the same time. In other words, this issue seems generic and I would not expect to be able to solve it completely in CAPM3.
I guess this would be one option.
The tooling we build for Sylva was done with in mind the short term need to fully support baremetal scenarios with Metal3, that no work was started for ClusterClass support in Metal3, and the fact that we felt that the kind of customization of manifests that clusterclass offers would require us to write runtime extensions for many many things. We thus opted for not trying clusterclass.
I'm actually wondering if a solution would not be to ensure that Metal3Machine would keep track of the intention of of having nodeReuse expressed in the template. For instance by having a spec.nodeReuse field (?).
This sounds worth investigating. It would be good to hear from those that designed the current API if this was considered already, and if it was, why it was not implemented this way.
I would definitely not propose to delete m3mt-1 before the rollout with m3mt-2 is done.
I'm confused. I thought this is what you already do and why you created the issue? If you can avoid doing this then what is the issue?
I would definitely not propose to delete m3mt-1 before the rollout with m3mt-2 is done.
I'm confused. I thought this is what you already do and why you created the issue? If you can avoid doing this then what is the issue?
Sorry for being confusing, I mixed what I would like to be (be able to delete m3mt-1) with what we are constrained to do today (we have to preserve it), and some confusion comes from the distinction between "an API request is made to delete the object" and "the object is actually removed up" (all finalizer are cleared).
Our goal would be the following: on an upgrade, create m3mt-2 from which new m3m will be built, and delete m3mt-1 (*); we're here assuming that we have a tool (for instance a Gitops tool, or Helm) that is able to produce the earlier versions of all resources, and if a rollback is wanted, it will be handled by that tool, which will produce a machine3machine template with same specs as m3mt-1.
(*): what does "detele m3mt-1" do ? ... that would depend on how the metal3 evolves to allow that...
Sorry again for making this more complex than it has to be ;)
Makes sense! Thanks for explaining! I wonder if it would be worth creating an issue in CAPI regarding the finalizer. :thinking: That would be the most logical place to handle it. (Adding the nodeReuse field to Metal3Machines is still worth considering here)
Yes, this is how we intend to use nodeReuse: along with disabling automated cleaning. The use case is explicitly to preserve PV data.
Interesting, how are you using disabling automated cleaning feature then, AFAICT it also uses M3MT to sync the value with M3M.
The infrastructure provider (CAPM3) is not normally interacting the InfraMachineTemplate (M3MT). That CAPM3 checks the nodeReuse from it is an exception. Normally it is just CAPI that reads the InfraMachineTemplate and creates InfraMachines from it. This indicates to me that we need to consider carefully before we introduce more "unusual" behavior. Perhaps we should also consider removing the check we have for nodeReuse.
I don't agree with that, then what is the use of https://github.com/metal3-io/cluster-api-provider-metal3/blob/321b813dd3684950282b3b6bb5e0382b9c043d9a/baremetal/metal3machinetemplate_manager.go#L90-L100 in the CAPM3 code? I mean we already have that unusuality as you are describing in the code base in other places.
I'm actually wondering if a solution would not be to ensure that Metal3Machine would keep track of the intention of of having nodeReuse expressed in the template. For instance by having a spec.nodeReuse field (?).
This sounds worth investigating. It would be good to hear from those that designed the current API if this was considered already, and if it was, why it was not implemented this way.
I don't recall exactly the specifics, but I believe M3MT was considered source of truth since it is the object that is created first and cloned to create M3M afterward, meaning having a user's intention set (nodeReuse field) right at the object on the higher level was the intention.
The infrastructure provider (CAPM3) is not normally interacting the InfraMachineTemplate (M3MT). That CAPM3 checks the nodeReuse from it is an exception. Normally it is just CAPI that reads the InfraMachineTemplate and creates InfraMachines from it. This indicates to me that we need to consider carefully before we introduce more "unusual" behavior. Perhaps we should also consider removing the check we have for nodeReuse.
I don't agree with that, then what is the use of
in the CAPM3 code? I mean we already have that unusuality as you are describing in the code base in other places.
To be honest, to me this feels like abusing the API. We are treating the M3MT more like a Deployment, which it is not intended to be. Good that you brought it up though! I think we should consider carefully if this is something that should be changed in the future.
I'm actually wondering if a solution would not be to ensure that Metal3Machine would keep track of the intention of of having nodeReuse expressed in the template. For instance by having a spec.nodeReuse field (?).
This sounds worth investigating. It would be good to hear from those that designed the current API if this was considered already, and if it was, why it was not implemented this way.
I don't recall exactly the specifics, but I believe M3MT was considered source of truth since it is the object that is created first and cloned to create M3M afterward, meaning having a user's intention set (nodeReuse field) right at the object on the higher level was the intention.
Still sounds strange to me since the "normal flow" would anyway take the source of truth (M3MT) and use it when creating the M3M. The only reason to go back to the template would be if the value changes on either the M3M or M3MT. These kind of changes are generally not allowed though (instead a new M3MT would be needed) so it is strange that we have implemented it this way.
Yes, this is how we intend to use nodeReuse: along with disabling automated cleaning. The use case is explicitly to preserve PV data.
Interesting, how are you using disabling automated cleaning feature then, AFAICT it also uses M3MT to sync the value with M3M.
Well, we are only at a stage where we want to ensure that all the pieces are aligned to allow using nodeReuse and disable automated cleaning. I hadn't identified that automatedCleaning
was currently implemented with the same reliance on M3MT. If so, then it's worth retitling this issue.
(For automated cleaning, my understanding is that we can also benefit from it by setting it directly in the BMH resource, although this comes with some implications.)
Still sounds strange to me since the "normal flow" would anyway take the source of truth (M3MT) and use it when creating the M3M. The only reason to go back to the template would be if the value changes on either the M3M or M3MT.
That was the design decision in the past agreed by the community, and of course, it might not always be ideal considering all use cases in advance, since node reuse had to take into account a lot of them due to upgrade scenarios.
These kind of changes are generally not allowed though (instead a new M3MT would be needed) so it is strange that we have implemented it this way.
strange, but why? That is the case with all metal3-objectX-template
objects including in the project, where we patch it in some way or another.
(For automated cleaning, my understanding is that we can also benefit from it by setting it directly in the BMH resource, although this comes with some implications.)
Yes, that is right. For that feature, you can directly set it in BMH and it should take a precedence over other options
Still sounds strange to me since the "normal flow" would anyway take the source of truth (M3MT) and use it when creating the M3M. The only reason to go back to the template would be if the value changes on either the M3M or M3MT.
That was the design decision in the past agreed by the community, and of course, it might not always be ideal considering all use cases in advance, since node reuse had to take into account a lot of them due to upgrade scenarios.
I understand, and I don't suggest changing it without thinking things through. All I'm saying is that there is a risk when we do things in "weird" ways. I think it would be worth considering alternatives, especially since there were unforeseen consequences that came from this design. Perhaps we could find a much better long term implementation now that we know more :slightly_smiling_face:
These kind of changes are generally not allowed though (instead a new M3MT would be needed) so it is strange that we have implemented it this way.
strange, but why? That is the case with all
metal3-objectX-template
objects including in the project, where we patch it in some way or another.
In general, changes can not be applied without a rolling update to the Machines. This is why the general rule is that the templates are immutable and should be updated by adding a new template and roll over to it. However, CAPI does allow exceptions so we are not directly breaking the contract here. It does complicate things though, and makes it much harder to reason about the state of the system, which can easily lead to more bugs.
Ref: https://cluster-api.sigs.k8s.io/tasks/updating-machine-templates
In general, changes can not be applied without a rolling update to the Machines. This is why the general rule is that the templates are immutable and should be updated by adding a new template and roll over to it. However, CAPI does allow exceptions so we are not directly breaking the contract here. It does complicate things though, and makes it much harder to reason about the state of the system, which can easily lead to more bugs.
Yes.
Note that https://cluster-api.sigs.k8s.io/tasks/updating-machine-templates gives a kind of frame to the exceptions to this loose rule:
Some infrastructure providers may, at their discretion, choose to support in-place modifications of certain infrastructure machine template fields. This may be useful if an infrastructure provider is able to make changes to running instances/machines, such as updating allocated memory or CPU capacity. In such cases, however, Cluster API will not trigger a rolling update.
I think that the idea is that the thing that are relevant for being mutable in the template are the things that the infra provider can implement on existing machines without a rolling update (like mem or CPU allocation for a virtual machine).
Let me push an update here.
In the context of Sylva, while working on rolling updates, we realized that:
The conclusion for us is that we're going to ensure that all these resources (Infra machine template and bootstrap provider config resource) are not deleted when we update the CAPI+providers resources.
Thorough inventory work would be required to see how the different infra providers behave, and Kubeadm... but overall, with this kind of MachineSet controller behavior in mind it seems like the wise path is to assume that XXXMachineTemplates need to be around for quite a while and maybe not try to have a "spot fix" here in Metal3 to hold off the actual m3mt deletion with a finalizer.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues will close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/remove-lifecycle stale
This is going back and forth between stale and active so I will just label it as frozen, feel free to continue the discussion, when implementation PR is pushed on the ticket I will remove the label. /lifecycle-frozen
Context: controller needs the Metal3MachineTemplate to determine
nodeReuse
, on Metal3Machine deletionToday, when a metal3machine is deleted, the metal3machine controller tries to fetch the Metal3MachineTemplate that was used to create it, and if it finds it, uses it to determine if node reuse is desired (reading
Metal3MachineTemplate.spec.nodeReuse
).https://github.com/metal3-io/cluster-api-provider-metal3/blob/9e5cf69e8e225801867df9940ca0f711b02cdd20/baremetal/metal3machine_manager.go#L588-L606
Implications on how cluster manifests are managed
If the tool used to managed Metal3Machine and Metal3MachineTemplates resources does garbage collection, it is possible that it would remove a Metal3MachineTemplate before the Metal3Machine using it is removed. The result is that
nodeReuse
can't reliably be used.We've hit this issue in the context of Sylva project, where we use a Helm chart to produce both Metal3Machine and Metal3MachineTemplates resources (along with other CAPI resources), and we use dynamic names for Metal3Machinetemplates to compensate for their immutability ensure the trigger of a rolling upgrade when the content of the template changes. What happens it that, on a change of metal3machine template content, the old Metal3MachineTemplates is removed prior to the rolling upgrade, so on Metal3Machine deletion that template never exists anymore, and
nodeReuse
is considered disabled.Discussion
We can have a workaround by disabling automatic garbage collection, and adding our own GC tool aware of this use of a Metal3MachineTemplates by Metal3Machines do a smarter garbage collection after a rolling upgrade.
This is however not elegant, requires additional tooling, and Metal3 users need to know about this issue to be able to avoid it.
Would it be possible to improve the controller behavior to avoid that ?
In particular:
nodeReuse
" annotation set directly on a Metal3Machine at creation time if it can't find the Metal3MachineTemplate ?Same for automatedCleaningMode
(Added on 2023-11-28)
During the discussion around the initial description of this issue, it came up that there is the same design and possible code evolution question about
automatedCleaningMode
.