controlplane init mutex configmap not properly reparented after backup/restore

jdef commented 1 year ago

What steps did you take and what happened: (a) velero-backup a namespace w/ CAPI-managed clusters/resources (b) delete the namespace (c) velero-restore the namespace w/ the CAPI-managed clusters/resources (d) observe that the configmap mutex is not properly adopted by the newly restored Cluster object

What did you expect to happen: proper adoption of the configmap by the Cluster so that subsequent "uninstallation" of the Cluster results in proper GC of the mutex configmap

Anything else you would like to add: a bunch of other adoption issues were fixed recently, looks like this was an oversight

see pkg bootstrap/kubeadm/internal/locking, the setMetadata func sets initial ownership, but thereafter ownership lingers unchanged, indefinitely

Environment:

Cluster-api version: 1.2.7
minikube/kind version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

jdef commented 1 year ago

other recent adoption fixes #7661 #7653

jdef commented 1 year ago

cc @sbueringer

killianmuldoon commented 1 year ago

Thanks for reporting this @jdef! The recent ownerReference fixes were based on the objects collected by clusterctl move. I think we need to consider whether this lock should be moved to a new cluster, or whether this issue should be solved independently.

fabriziopandini commented 1 year ago

FYI I'm also talking to the velero team asking they restore ownerRef from the backup instead of dropping them, which will solve the real root cause of all those issues. It will be really help if they start getting reports from end users with the same ask

killianmuldoon commented 1 year ago

@fabriziopandini (while you're here :smile:) do you think the mutex lock should be added to the objectGraph for move etc.?

fabriziopandini commented 1 year ago

Just to avoid misunderstanding, with cluster mutex are you referring to the init lock config map that CABPK uses to ensure only a control plane instance runs init? In that case, if I'm not wrong the config map already has an ownerRef to the cluster when created, and the fact that it is included in the restore somehow proves it.

The problem is that velero intentionally drops owner references when restoring data (in fact velero only partially restore data). and this IMO is not ideal for the following reasons:

it leaves the system in an unstable state that can last for some time depending on the how much time the restore takes/the time that all the controllers take to reconcile all the objects
considering that velero not only drops ownerReferences but also finalizers and managed fields, many unexpected things can happen while the system is in this undefined state, like e.g. failing to clean up infrastructure components.
it creates a storm of reconcile events, that can impact the entire system responsiveness given that the volume of events to be processed is much higher than what happen usually (with related issues about memory consumption, rate limiting when interacting with external systems like e.g cloud providers in case of Cluster API)
it exposes the system to race conditions very difficult to identify and fix, because both the restore order and the reconcile order are not deterministic
it forces all the controllers, Cluster API, and all the other controllers in the cluster etc., to deal with those race conditions, which are likely never to happen during normal operations

All those problems could be avoided if instead velero restores (rebuilds) the ownerRefererences existing in the original objects

That's what I meant by fixing the root cause vs dealing with the issues caused by it.

Going back to this specific issue, I personally would prefer to avoid the complexity of implementing in the CAPI controllers the logic to rebuild the ownerRef if missing given that the mutex configmap is short-lived.

jdef commented 1 year ago

yes, the init lock configmap created by CAPBK

our process for backup/restore is (loosely): (a) pause CAPI cluster reconciliation (via "spec.paused") (b) back things up .... (c) create new ns in new mgmt cluster (d) restore paused CAPI objects to new ns (e) unpause things

we've generally had success w/ this approach, other than ownership not being properly restored. recent CAPI releases have mostly fixed that, w/ the exception noted in this bug report.

i'm not opposed to a velero-based fix, though i'm left wondering who's going to be left holding the hot potato...

sbueringer commented 1 year ago

Given that we resolved all the other findings that we had with ownerRefs, I would be in favor of resolving this one as well. I think for Cluster API users it's not ideal to fix all but one occurence of the problem.

I might be misreading the code in control_plane_init_mutex.go but it looks to me like the fix is relatively simple.

killianmuldoon commented 1 year ago

Given that we resolved all the other findings that we had with ownerRefs, I would be in favor of resolving this one as well. I think for Cluster API users it's not ideal to fix all but one occurence of the problem.

Agreed in general, just wondering why this isn't included in clusterctl move and whether it should be.

sbueringer commented 1 year ago

Could it be that the lock ConfigMap only exists temporarily?

Not sure but it looks like Lock creates the ConfigMap and Unlock deletes it. So when the lock is currently not held by anyone the ConfigMap shouldn't be there.

So I think to actually move it with clusterctl move you have to hit the right moment / the lock must be in a deadlock.

jdef commented 1 year ago

AFAICT Unlock is only called if the machine named in the lock isn't found; or am I missing something obvious?

sbueringer commented 1 year ago

jdef commented 1 year ago

🤦‍♂️ thanks

aside from the Unlock scenario I mentioned earlier... quickly skimming, the unlock in handleClusterNotInitialized is invoked conditionally (if an error happens while the lock is held). if no errors occur in this helper func, after Lock in invoked, then Unlock is not called here.

in kubeadm_controller.go:Reconcile - possible that Unlock is never reached (because the kubeadm config has already been populated ahead if time w/ join config)? more specifically, i'm looking at the preceding switch statement and possible early returns. our internal controllers (we don't use clusterctl or the newer cluster templating magic) always populate JoinConfiguration.

our cluster spinup process consistently results in an indefinitely held lock once the cluster is up. so i wonder if we're spinning up resources w/ configs outside the well-travelled runtime path?

EDIT

in other words, if the intent is that the lock is normally intended to be "short term" (outside of deadlock scenarios) then perhaps there's a bad assumption somewhere in the reconciliation path here w/ respect to order-of-operations. NOTE: even if that were the case, and the intention is that the lock is short-lived, it still seems like there's a chance that one could pause cluster reconciliation, backup-then-restore, and still end up w/ a short-term lock that requires proper adoption (but i realize both that the chances are slim and that this could not be a super-interesting scenario for most folks).

maybe someone w/ knowledge of the original intent here (w/ respect to lock lifecycle) can comment in order to clarify?

killianmuldoon commented 1 year ago

our cluster spinup process consistently results in an indefinitely held lock once the cluster is up. so i wonder if we're spinning up resources w/ configs outside the well-travelled runtime path?

This doesn't sound good - what version of CAPI are you using? Is there a way to reproduce? How do you ordinarily unlock it if the lock is held indefinitely?

sbueringer commented 1 year ago

Looking at the code I would assume that whenever a node is joined (2nd control plane node or workers) we hit this code: https://github.com/kubernetes-sigs/cluster-api/blob/f2ba999757aa76b446943a9ffea14ecf8c7523e8/bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go#L298-L304

jdef commented 1 year ago

our cluster spinup process consistently results in an indefinitely held lock once the cluster is up. so i wonder if we're spinning up resources w/ configs outside the well-travelled runtime path?

This doesn't sound good - what version of CAPI are you using? Is there a way to reproduce? How do you ordinarily unlock it if the lock is held indefinitely?

1.2.7, as stated in the desc

EDIT

repro: not in the "open" - our codebase is not OSS and legal would not be very happy w/ me for sharing code. that said, internally, can can repro w/ every cluster spinup we successfully execute.

EDIT 2

ordinary unlock: apparently this doesn't happen, and the lock survives indefinitely

jdef commented 1 year ago

Looking at the code I would assume that whenever a node is joined (2nd control plane node or workers) we hit this code:

https://github.com/kubernetes-sigs/cluster-api/blob/f2ba999757aa76b446943a9ffea14ecf8c7523e8/bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go#L298-L304

our test clusters are singletons (single-node clusters). lemme double-check that the locks are also held indefinitely on our prod, multi-node clusters

jdef commented 1 year ago

@sbueringer good insight - our multi-machine clusters do not have an indefintely-held lock, only the singleton clusters

fabriziopandini commented 1 year ago

I don't want to block if the fix is small.

But as far as I'm concerned we fixed ownerRef management across the code base to tackle the use case "someone accidentally deletes an ownerRef CAPI relies on". In this specific case, this is not necessary because the object is ephemeral, CAPI will just work no matter if the ownerRef exists or not (the information isn't used by the system), and the object garbage collection is ensured by un-lock.

velero use case is different because they intentionally drop all the owner refs in a single operation, and this behavior is problematic not only because of the magnitude/concurrency of occurrences of the issue but also because it impacts not only CAPI but all the controllers existing in the cluster.

This is why I'm advocating for more people to raise this concern with the velero team in order to fix the root cause of the problem, otherwise, everyone else will be forced to apply patches to this behavior and this approach - everyone fixing one's problem - usually is much worst in term of ownership, robustness, complexity etc.

fabriziopandini commented 1 year ago

/triage accepted

sbueringer commented 1 year ago

Definitely +1 on bringing it up with the Velero team.

If the cluster stays locked (could happen for various reasons, e.g. init takes a long time / doesn't go through) the ConfigMap/lock will exist permanently. I think the ConfigMap is only ephemeral on the happy path.

It seems to me that there should be an issue in combination with clusterctl move. After clusterctl move (assuming the ownerRef to the lock got lost either accidentally or through Velero) we are losing the ConfigMap and thus the lock and interesting things could happen.

killianmuldoon commented 8 months ago

/help

k8s-ci-robot commented 8 months ago

@killianmuldoon: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/7794): >/help Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

fabriziopandini commented 3 months ago

/priority important-longterm

kubernetes-sigs / cluster-api

controlplane init mutex configmap not properly reparented after backup/restore #7794

Guidelines