kubernetes-sigs / cluster-api-provider-kubevirt

Cluster API Provider for KubeVirt
Apache License 2.0
110 stars 63 forks source link

Remove vmi/vm watch and reduce default sync time to 90 seconds #216

Closed davidvossel closed 1 year ago

davidvossel commented 1 year ago

Issue #100 and #78 have two things in common, they both involve issues derived from tracking VM/VMIs on external infra.

So far, we've been treating external infra as an advanced usecase, and not the default use case. The result is that external infra has been treated as a second class citizen to the use case where the capi components and VM/VMIs are running on the same k8s cluster.

I'd like to return to a single code path that satisfies the use case where the capk/capi components are running on the same cluster as the VM/VMIs (centralized infra use case) and the use case where the controller components run on a separate cluster from the VM/VMIs (external infra use case)

To achieve this, we can't assume that the VM/VMI objects are even registered on the same cluster as the capi/capk controllers. Which means we can't watch these objects by default using the default in cluster config. I propose we return back to depending on syncing the KubeVirtMachine and KubeVirtCluster objects regularly in order to pickup VM/VMI changes (basically polling). For polling to be responsive enough to pick up things like IP changes after a VM reboot, I think we should lower the default polling interval to 60 seconds.

Reduce default sync time to 90 seconds for reconcile loops
k8s-ci-robot commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/cluster-api-provider-kubevirt/blob/main/OWNERS)~~ [davidvossel] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
davidvossel commented 1 year ago

/ok-to-test

coveralls commented 1 year ago

Pull Request Test Coverage Report for Build 3856719842


Totals Coverage Status
Change from base Build 3599569808: 1.2%
Covered Lines: 952
Relevant Lines: 1808

💛 - Coveralls
qinqon commented 1 year ago

If think we can do both, so we have responsiveness and we don't miss stuff, what do you think ?

davidvossel commented 1 year ago

If think we can do both, so we have responsiveness and we don't miss stuff, what do you think ?

How would you do this.

qinqon commented 1 year ago

If think we can do both, so we have responsiveness and we don't miss stuff, what do you think ?

How would you do this.

You have a pair of "knobs":

I think with both you can have both features.

davidvossel commented 1 year ago

I think with both you can have both features.

The problem is that we need the capk controller to be able to function when the VM/VMI objects are not registered in the infra cluster. The watches fail when there are no VM/VMI crds registered.

I could dynamically detect of vm/vmis are present at launch, and only watch if they are, but i'd rather have a single code path to test rather than multiple (one when crds are available, one when they are not). So just using a reduced sync period with no watches seems to satisfy that requirement (at the cost of efficiency, which might be okay for us at our current scale)

qinqon commented 1 year ago

I think with both you can have both features.

The problem is that we need the capk controller to be able to function when the VM/VMI objects are not registered in the infra cluster. The watches fail when there are no VM/VMI crds registered.

I could dynamically detect of vm/vmis are present at launch, and only watch if they are, but i'd rather have a single code path to test rather than multiple (one when crds are available, one when they are not). So just using a reduced sync period with no watches seems to satisfy that requirement (at the cost of efficiency, which might be okay for us at our current scale)

An alternative is to run the polling for external clusters and as a result enqueue to the controller so they enter the Reconcile function, this way we don't have to remove the watchers.

qinqon commented 1 year ago

/lgtm Let's hope we don't hit control plane too much.