Open Oats87 opened 10 months ago
This issue is currently awaiting triage.
If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
I think the current recommended way to do this is to use the paused
annotation - what are the limitations of this that make it a less than ideal solution?
I think the current recommended way to do this is to use the
paused
annotation - what are the limitations of this that make it a less than ideal solution?
I'm assuming the requirement to set the paused annotation on all objects rather than, e.g. just the cluster object?
@Oats87 You say:
it is desirable to be able to run the conversion webhook server while "paving" the objects into place,
Can you expand on that?
I think the current recommended way to do this is to use the
paused
annotation - what are the limitations of this that make it a less than ideal solution?
While using the paused
annotation is certainly a solution, "in the trenches" it can be pretty difficult to guarantee the paused
annotation is set on the object during restoration.
There are a few cases where a user may be trying to pave objects back into place, specifically off the top of my head I can think of
I believe Velero has a method to inject json patches to the objects being restored which would allow for adding the annotation, but I think you'd have to manually unpause all objects after performing the restoration.
@Oats87 You say:
it is desirable to be able to run the conversion webhook server while "paving" the objects into place,
Can you expand on that?
Sure. In my specific case we need to be able to perform an object level restoration (etcd restoration is not possible in managed Kubernetes providers such as EKS/AKS/GKE in our case) which mandates the need to restore objects into place. I say "pave" because we're taking a fresh cluster and restoring i.e. paving objects into it. If the object apiVersion is at the "pivot" version of the CRD i.e. the stored version, the conversion webhook is not necessary, if the object being restored requires conversion the conversion webhook server must be running in order to perform the restoration.
What happens if you run the CAPI controllers with 0 controller replicas?
What happens if you run the CAPI controllers with 0 controller replicas?
Then you wouldn't be running the conversion webhooks either because they're both under the same manager.
Oh sorry. What I meant was to run the CAPI controllers with 0 concurrency (you still run a Pod but without any controller workers)
Oh sorry. What I meant was to run the CAPI controllers with 0 concurrency (you still run a Pod but without any controller workers)
Testing v1.4.4
I see that it defaults count=1
if --<controller>-concurrency=0
:
i.e.
<snip>
spec:
containers:
- args:
- --leader-elect
- --metrics-bind-addr=localhost:8080
- --feature-gates=MachinePool=false,ClusterResourceSet=false,ClusterTopology=false,RuntimeSDK=false,LazyRestmapper=false
- --cluster-concurrency=0
command:
- /manager
env:
<snip>
results in:
I0912 17:12:11.918417 1 controller.go:228] "Starting workers" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" worker count=1
So best workaround as of now is probably to set the watch-filter value to some selector that doesn't match anything (like you listed above). Which should be a supported use case of that flag based on it's description:
fs.StringVar(&watchFilterValue, "watch-filter", "",
fmt.Sprintf("Label value that the controller watches to reconcile cluster-api objects. Label key is always %s. If unspecified, the controller watches for all cluster-api objects.", clusterv1.WatchLabel))
Basically, "If you want to reconcile nothing, set the filter to match nothing"
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/lifecycle frozen /priority backlog /triage accepted
Looking for ways to improve how to recovery from disasters is always valuable. However, I'm debated if a long-term solution for this issue should be implemented in controller runtime - thus standardizing this across all the controllers - or if it should be implemented in CAP* by skipping setupReconcilers.
What would you like to be added (User Story)?
As an operator, I would like to be able to manipulate cluster-api objects in certain situations without having them immediately reconciled by the
capi-controller-manager
Detailed Description
Today, the conversion webhooks are nested with the
capi-controller-manager
. This proves problematic in scenarios where the management cluster is being restored from an object-level backup, using a tool like https://github.com/rancher/backup-restore-operator or https://velero.io/ as the conversion webhook on the CRD will fail if the webhook endpoint is not available.During these restoration operations, it is desirable to be able to run the conversion webhook server while "paving" the objects into place, but not having immediate reconciliation actions take place on them until the restoration is complete. I am thinking for example, the case where a
MachineSet
has childMachines
and if theMachineSet
object is restored while controllers are running, it could spawn newMachine
objects that have not yet been restored.There are a few workarounds to this, namely:
cluster.x-k8s.io/paused=true
0
(not tested but theoretically this works?)but all of them are less than ideal solutions for a "robust" DR strategy.
Anything else you would like to add?
No response
Label(s) to be applied
/kind feature /area provider/core