kubernetes-sigs / cluster-api-provider-vsphere

Apache License 2.0
368 stars 292 forks source link

Ensure reconcileLoadBalancedEndpoint Function Gets Called Despite Setting of ControlPlaneEndpoint #2872

Closed silvery1622 closed 2 weeks ago

silvery1622 commented 6 months ago

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.] The current functionality of CAPV bypasses the reconcileLoadBalancedEndpoint if the ControlPlaneEndpoint is set in either the cluster or vspherecluster. Nonetheless, undesirable scenarios may arise, such as the accidental deletion of the virtualmachineservice corresponding to the cluster control plane service, which disables the ControlPlaneEndpoint.

What did you expect to happen: When reconciling the vspherecluster, CAPV should execute reconcileLoadBalancedEndpoint so it could create the virtualmachineservice if necessary.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Related code link https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/controllers/vmware/vspherecluster_reconciler.go#L203

Environment:

sbueringer commented 6 months ago

Sounds good! Feel free to open a PR

zhanggbj commented 6 months ago

Good finding! For records, the current workaround to recover the whole cluster:

- Manually create a VirtualMachineService and get the new accessible LoadBalancer IP address. - Update the VSphereCluster.Spec.ControlPlaneEndpoint with the IP address

sbueringer commented 6 months ago

Wait, doesn't this change the controlplane endpoint? Does this actually recover the cluster? Aren't all our kubelets using this endpoint?

zhanggbj commented 6 months ago

Sorry please ignore this... it works in local env as the LoadBalance IP doesn't change...

silvery1622 commented 6 months ago

After discussion with @zhanggbj, we feel we 'd better block the deletion instead of recreation.

sbueringer commented 6 months ago

Yup makes sense. I'm wondering how exactly :)

As this is a CRD of another controller, not sure how we can/should block the deletion

(we should probably talk to the vm-operator folks, IIRC that the CRD is from there)

The tricky part is how to distinguish between an intended and an accidental delete.

chrischdi commented 6 months ago

A different approach would be to make it more easy to discover this issue / bring up a condition or so at the VSphereCluster object, that the object does not exist anymore and for this first iteration still require manually fixing it.

silvery1622 commented 5 months ago

From what I understand, the vm-operator isn't currently set up to handle VirtualMachineService deletions in its webhook. https://github.com/vmware-tanzu/vm-operator/blob/main/webhooks/virtualmachineservice/v1alpha2/validation/virtualmachineservice_validator.go#L103 Since CAPV is already in charge of the creation of VirtualMachineService for the TKG control plane, what if we also let CAPV handle safeguards against accidental deletions using use a webhook in ValidatingWebhookConfiguration.

Also, @chrischdi’s idea about using a condition inside VSphereCluster objects could work well in parallel, giving us an extra layer of visibility into this issue.

sbueringer commented 5 months ago

The tricky part is how to distinguish between an intended and an accidental delete.

(independent of where we implement the webhook)

zhanggbj commented 5 months ago

Considering the cluster has a dependency on VirtualMachineService, VSphereCluster created VirtualMachineService but it's vm-operator who manages VirtualMachineService, I'm thinking in CAPV side:

  1. Add a finalizer when VSphereCluster to create VirtualMachineService.
  2. Remove the finalizer when deleting VSphereCluster.
  3. We should reach out to the vm-operator team to inform them about this finalizer.

Regarding the ValidationWebhook, I think it's more appropriate for the controller who manages the CR to implement one. In this case, it's vm-operator, but validation needs the cluster context to check whether to prevent deletion, as vm-operator is a separate controller, so I think it's more proper to add finalizer from CAPV side.

sbueringer commented 5 months ago

I think the idea of finalizers is to signal "oh there is some cleanup work to do for me here", not to block other controllers from doing there deletion.

zhanggbj commented 5 months ago

Yes, I agree, that should indeed be the intention of the finalizer. However, I think it could be beneficial in this scenario to prevent the deletion of the VirtualMachineService until the VsphereCluster is deleted.

Another option is adding a ValidationWebhook within CAPV to reject delete requests for VirtualMachineService, as previously discussed.

I don't have strong opinions on either of these options. It appears to me that both options are trying to interfere with a CR managed by a separate controller, and both would require coordination with the vm-operator team.

chrischdi commented 5 months ago

Just adding a finaliser could still lead to users deleting the VirtualMachineService. They still can use kubectl delete --force --grace-period=0 to delete the object, or kubectl delete + remove the finaliser.

Also: blocking the final deletion of the object with a finaliser may just lead to "the object is still around", but the VirtualMachineService may already be deprovisioned or we may have other side effects.

I think the scope for CAPV for this issue ("VirtualMachineService is gone" without CAPV having delete dit) should be to be able to handle all scenarios well:

sbueringer commented 5 months ago

Also: blocking the final deletion of the object with a finaliser may just lead to "the object is still around", but the VirtualMachineService may already be deprovisioned or we may have other side effects.

Yup that's the point I was trying to make. We can keep the object around, but vm-operator will still go through the deletion. And it would be a strange dependency if the vm-operator skips deletion if our finalizer is present

If vm-operator itself provides some sort of hook to block deletion, that's a different discussion (e.g. like CAPI has a before cluster delete hook)

zhanggbj commented 5 months ago

Now I get it, the key point is that adding a finalizer can not prevent the deletion of the underneath resources from vm-operator, so it is not a viable option to safeguard the VirtualMachineService from unintended deletions.

About another option as mentioned above, if I understand it correctly, leveraging a pre-deletion hook approach, by adding some annotation, vm-operator team will recognize it and pause the reconcile/delete until CAPV remove the annotation. Obviously it needs discussion with vm-operator team.

sbueringer commented 5 months ago

Yup, that would be the difference.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/2872#issuecomment-2351630243): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.