Open jdef opened 1 year ago
other recent adoption fixes #7661 #7653
cc @sbueringer
Thanks for reporting this @jdef! The recent ownerReference fixes were based on the objects collected by clusterctl move
. I think we need to consider whether this lock should be moved to a new cluster, or whether this issue should be solved independently.
FYI I'm also talking to the velero team asking they restore ownerRef from the backup instead of dropping them, which will solve the real root cause of all those issues. It will be really help if they start getting reports from end users with the same ask
@fabriziopandini (while you're here :smile:) do you think the mutex lock should be added to the objectGraph for move etc.?
Just to avoid misunderstanding, with cluster mutex are you referring to the init lock config map that CABPK uses to ensure only a control plane instance runs init? In that case, if I'm not wrong the config map already has an ownerRef to the cluster when created, and the fact that it is included in the restore somehow proves it.
The problem is that velero intentionally drops owner references when restoring data (in fact velero only partially restore data). and this IMO is not ideal for the following reasons:
All those problems could be avoided if instead velero restores (rebuilds) the ownerRefererences existing in the original objects
That's what I meant by fixing the root cause vs dealing with the issues caused by it.
Going back to this specific issue, I personally would prefer to avoid the complexity of implementing in the CAPI controllers the logic to rebuild the ownerRef if missing given that the mutex configmap is short-lived.
yes, the init lock configmap created by CAPBK
our process for backup/restore is (loosely): (a) pause CAPI cluster reconciliation (via "spec.paused") (b) back things up .... (c) create new ns in new mgmt cluster (d) restore paused CAPI objects to new ns (e) unpause things
we've generally had success w/ this approach, other than ownership not being properly restored. recent CAPI releases have mostly fixed that, w/ the exception noted in this bug report.
i'm not opposed to a velero-based fix, though i'm left wondering who's going to be left holding the hot potato...
Given that we resolved all the other findings that we had with ownerRefs, I would be in favor of resolving this one as well. I think for Cluster API users it's not ideal to fix all but one occurence of the problem.
I might be misreading the code in control_plane_init_mutex.go
but it looks to me like the fix is relatively simple.
Given that we resolved all the other findings that we had with ownerRefs, I would be in favor of resolving this one as well. I think for Cluster API users it's not ideal to fix all but one occurence of the problem.
Agreed in general, just wondering why this isn't included in clusterctl move
and whether it should be.
Could it be that the lock ConfigMap only exists temporarily?
Not sure but it looks like Lock creates the ConfigMap and Unlock deletes it. So when the lock is currently not held by anyone the ConfigMap shouldn't be there.
So I think to actually move it with clusterctl move
you have to hit the right moment / the lock must be in a deadlock.
AFAICT Unlock
is only called if the machine named in the lock isn't found; or am I missing something obvious?
🤦♂️ thanks
aside from the Unlock scenario I mentioned earlier... quickly skimming, the unlock in handleClusterNotInitialized
is invoked conditionally (if an error happens while the lock is held). if no errors occur in this helper func, after Lock in invoked, then Unlock is not called here.
in kubeadm_controller.go:Reconcile - possible that Unlock is never reached (because the kubeadm config has already been populated ahead if time w/ join config)? more specifically, i'm looking at the preceding switch
statement and possible early returns. our internal controllers (we don't use clusterctl
or the newer cluster templating magic) always populate JoinConfiguration.
our cluster spinup process consistently results in an indefinitely held lock once the cluster is up. so i wonder if we're spinning up resources w/ configs outside the well-travelled runtime path?
EDIT
in other words, if the intent is that the lock is normally intended to be "short term" (outside of deadlock scenarios) then perhaps there's a bad assumption somewhere in the reconciliation path here w/ respect to order-of-operations. NOTE: even if that were the case, and the intention is that the lock is short-lived, it still seems like there's a chance that one could pause cluster reconciliation, backup-then-restore, and still end up w/ a short-term lock that requires proper adoption (but i realize both that the chances are slim and that this could not be a super-interesting scenario for most folks).
maybe someone w/ knowledge of the original intent here (w/ respect to lock lifecycle) can comment in order to clarify?
our cluster spinup process consistently results in an indefinitely held lock once the cluster is up. so i wonder if we're spinning up resources w/ configs outside the well-travelled runtime path?
This doesn't sound good - what version of CAPI are you using? Is there a way to reproduce? How do you ordinarily unlock it if the lock is held indefinitely?
Looking at the code I would assume that whenever a node is joined (2nd control plane node or workers) we hit this code: https://github.com/kubernetes-sigs/cluster-api/blob/f2ba999757aa76b446943a9ffea14ecf8c7523e8/bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go#L298-L304
our cluster spinup process consistently results in an indefinitely held lock once the cluster is up. so i wonder if we're spinning up resources w/ configs outside the well-travelled runtime path?
This doesn't sound good - what version of CAPI are you using? Is there a way to reproduce? How do you ordinarily unlock it if the lock is held indefinitely?
1.2.7, as stated in the desc
EDIT
repro: not in the "open" - our codebase is not OSS and legal would not be very happy w/ me for sharing code. that said, internally, can can repro w/ every cluster spinup we successfully execute.
EDIT 2
ordinary unlock: apparently this doesn't happen, and the lock survives indefinitely
Looking at the code I would assume that whenever a node is joined (2nd control plane node or workers) we hit this code:
our test clusters are singletons (single-node clusters). lemme double-check that the locks are also held indefinitely on our prod, multi-node clusters
@sbueringer good insight - our multi-machine clusters do not have an indefintely-held lock, only the singleton clusters
I don't want to block if the fix is small.
But as far as I'm concerned we fixed ownerRef management across the code base to tackle the use case "someone accidentally deletes an ownerRef CAPI relies on". In this specific case, this is not necessary because the object is ephemeral, CAPI will just work no matter if the ownerRef exists or not (the information isn't used by the system), and the object garbage collection is ensured by un-lock.
velero use case is different because they intentionally drop all the owner refs in a single operation, and this behavior is problematic not only because of the magnitude/concurrency of occurrences of the issue but also because it impacts not only CAPI but all the controllers existing in the cluster.
This is why I'm advocating for more people to raise this concern with the velero team in order to fix the root cause of the problem, otherwise, everyone else will be forced to apply patches to this behavior and this approach - everyone fixing one's problem - usually is much worst in term of ownership, robustness, complexity etc.
/triage accepted
Definitely +1 on bringing it up with the Velero team.
If the cluster stays locked (could happen for various reasons, e.g. init takes a long time / doesn't go through) the ConfigMap/lock will exist permanently. I think the ConfigMap is only ephemeral on the happy path.
It seems to me that there should be an issue in combination with clusterctl move
. After clusterctl move
(assuming the ownerRef to the lock got lost either accidentally or through Velero) we are losing the ConfigMap and thus the lock and interesting things could happen.
/help
@killianmuldoon: This request has been marked as needing help from a contributor.
Please ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
/priority important-longterm
What steps did you take and what happened: (a) velero-backup a namespace w/ CAPI-managed clusters/resources (b) delete the namespace (c) velero-restore the namespace w/ the CAPI-managed clusters/resources (d) observe that the configmap mutex is not properly adopted by the newly restored Cluster object
What did you expect to happen: proper adoption of the configmap by the Cluster so that subsequent "uninstallation" of the Cluster results in proper GC of the mutex configmap
Anything else you would like to add: a bunch of other adoption issues were fixed recently, looks like this was an oversight
see pkg bootstrap/kubeadm/internal/locking, the
setMetadata
func sets initial ownership, but thereafter ownership lingers unchanged, indefinitelyEnvironment:
kubectl version
):/etc/os-release
):/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]