jayunit100 commented 1 year ago

/kind bug

Note: This might , thus, be a cluster-api bug instead of a capv bug, but I figure the fact that the VSphereMachine disappeared , means that maybe CAPV could possibly do something to prevent this bug from occuring. Im happy to close this issue and file it in upstream CAPI if folks thing it would be better there.

What steps did you take and what happened:

I ran a test for 50 days, where i periodically used an iptables rule to cut off connectivity from a Workload Cluster, to a Management cluster, i.e.

 sudo iptables -i eth1 -A INPUT -s 10.180.83.152 -j DROP

In general, reconnecting it rescued the cluster. But eventually, the IP Address of one of the nodes in the cluster lost its DHCP lease, I think.
Sometime after the machines IP became unroutable... appears that the "Machine" object continued to be marked as Running, but the VsphereMachine was de-registered/cleaned up (i.e. by MHC I assume).

Workarounds ?

I tried to manually delete the machine, but that seemed to not

Details

This can bee seen in the logs below:

kubo@ubuntu-2004:~$ kubectl --context tkg-mgmt-vc-admin@tkg-mgmt-vc get vspheremachine -A
NAMESPACE    NAME                                CLUSTER         READY   PROVIDERID
default      tkg-vc-antrea-control-plane-67x6p   tkg-vc-antrea   true    vsphere://421e327c-d5ab-f0e0-1660-b8008fe1272d
tkg-system   tkg-mgmt-vc-control-plane-wh6dd     tkg-mgmt-vc     true    vsphere://421e5076-ebf4-9df7-13c0-c6bb8b7e2122
tkg-system   tkg-mgmt-vc-worker-ph8l6            tkg-mgmt-vc     true    vsphere://421e54bc-c5a9-7db4-e9ed-62c83e1f0ef0

(Note that there is only one node, the CP node, in the tkg-vc-antrea workload cluster.

Now, below, looking at the "machines" we'll see theres a ghost machine that is running. Which is non existent (tkg-vc-antrea-md-0-b79b98c6d-8sfjs)....

kubo@ubuntu-2004:~$ kubectl --context tkg-mgmt-vc-admin@tkg-mgmt-vc get machine -A
NAMESPACE    NAME                                 CLUSTER         NODENAME                             PROVIDERID                                       PHASE     AGE   VERSION
default      tkg-vc-antrea-control-plane-ltd7f    tkg-vc-antrea   tkg-vc-antrea-control-plane-ltd7f    vsphere://421e327c-d5ab-f0e0-1660-b8008fe1272d   Running   51d   v1.23.8+vmware.2
default      tkg-vc-antrea-md-0-b79b98c6d-8sfjs   tkg-vc-antrea   tkg-vc-antrea-md-0-b79b98c6d-8sfjs   vsphere://421eae53-7022-92d2-4808-4f5550a94bfa   Running   51d   v1.23.8+vmware.2
default      tkg-vc-antrea-md-0-b79b98c6d-phmxc   tkg-vc-antrea                                                                                         Pending   75m   v1.23.8+vmware.2
tkg-system   tkg-mgmt-vc-control-plane-g7szv      tkg-mgmt-vc     tkg-mgmt-vc-control-plane-g7szv      vsphere://421e5076-ebf4-9df7-13c0-c6bb8b7e2122   Running   51d   v1.23.8+vmware.2
tkg-system   tkg-mgmt-vc-md-0-7bd8b547f4-m6pt4    tkg-mgmt-vc     tkg-mgmt-vc-md-0-7bd8b547f4-m6pt4    vsphere://421e54bc-c5a9-7db4-e9ed-62c83e1f0ef0   Running   51d   v1.23.8+vmware.2

Notes

I see no evidence in the logs that capv-controller-manager is attempting to do anything with the above machine in this machine deployment, i.e. kubectl --context tkg-mgmt-vc-admin@tkg-mgmt-vc logs -f capv-controller-manager-bcd4d7496-zl4gx -n capv-system | grep tkg-vc-antrea-md-0- turns up empty.

Version

1.3.1

cluster-api-vsphere-controller:v1.3.1_vmware.1

Workaround?

I tried to delete the ghost machine, tkg-vc-antrea-md-0-b79b98c6d-8sfjs, however, it didnt go away. It has a finalizer on it: machine.cluster.x-k8s.io. So I figured maybe CAPI is the one who owns that finalizer, and i looked in capi-controller-manager... What i found was it seemed unhappy ...

E1022 18:02:21.589293       1 controller.go:317] controller/machine "msg"="Reconciler error" "error"="error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"default/tkg-vc-antrea\": Get \"https://10.180.81.244:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" "name"="tkg-vc-antrea-md-0-b79b98c6d-rbpgs" "namespace"="default" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"

So, it seems like I would hypothesize here that, you cant delete machines if their IPs are unreachable, maybe, because the finalizer logic in capi becomes unhappy... And this prevents in cases where theres a netsplit between capi and the nodes it is monitoring, the healthy cleanup of the nodes after the fact.

jayunit100 commented 1 year ago

This seems to be associated w/ a netsplit between some of the nodes on my mgmt cluster (the workers that run capi controller manager) losing connectivity permanantly to the APIServer (kubevip) on the Workload clusters.

Seems like resource reconcilation should still work, though, even if such netsplits happen, as long as the capv-controller manager can still talk to Vsphere instances...

Drew a diagram of the netsplit here after doing more investigation...

jayunit100 commented 1 year ago

Am assuming maybe draining any (mgmt worker VMs) that are unable to connect to (WL apiserver) is a workaround, will test..

jayunit100 commented 1 year ago

ah, "Deleting a kubernetes node associated with a machine is not allowed" after i bounced the capi-controller to a healthy, connected node.... (just a note to self)... will see how to resolve that later....

jayunit100 commented 1 year ago

After trying to

bounce ccm over to a node that isnt netsplit, it seemed happier
i tried to manually kill of finalizers to rejigger the system
ended up w/ something like this...


NAME                                CLUSTER         READY   PROVIDERID                                       MACHINE
tkg-vc-antrea-control-plane-67x6p   tkg-vc-antrea   true    vsphere://421e327c-d5ab-f0e0-1660-b8008fe1272d   tkg-vc-antrea-control-plane-ltd7f
tkg-vc-antrea-worker-8w6wx          tkg-vc-antrea
tkg-vc-antrea-worker-lksrl          tkg-vc-antrea
tkg-vc-antrea-worker-vw5n9          tkg-vc-antrea   true    vsphere://421e616a-0212-5a90-efb7-588b201f9006   tkg-vc-antrea-md-0-b79b98c6d-6kb74
kubo@ubuntu-2004:~$ kubectlmgmt get machine -o wide
NAME                                 CLUSTER         NODENAME                             PROVIDERID                                       PHASE          AGE     VERSION
tkg-vc-antrea-control-plane-ltd7f    tkg-vc-antrea   tkg-vc-antrea-control-plane-ltd7f    vsphere://421e327c-d5ab-f0e0-1660-b8008fe1272d   Running        53d     v1.23.8+vmware.2
tkg-vc-antrea-md-0-b79b98c6d-448ml   tkg-vc-antrea                                                                                         Provisioning   8m57s   v1.23.8+vmware.2
tkg-vc-antrea-md-0-b79b98c6d-6kb74   tkg-vc-antrea   tkg-vc-antrea-md-0-b79b98c6d-6kb74   vsphere://421e616a-0212-5a90-efb7-588b201f9006   Running        8m27s   v1.23.8+vmware.2```

jayunit100 commented 1 year ago

ultimately, was able to fix everything back

kubo@ubuntu-2004:~$ kubectlmgmt get vspheremachine -o wide
NAME                                CLUSTER         READY   PROVIDERID                                       MACHINE
tkg-vc-antrea-control-plane-67x6p   tkg-vc-antrea   true    vsphere://421e327c-d5ab-f0e0-1660-b8008fe1272d   tkg-vc-antrea-control-plane-ltd7f
tkg-vc-antrea-worker-v2vnw          tkg-vc-antrea   true    vsphere://421e1267-7ed9-11a7-e2ba-6c13b964fc2f   tkg-vc-antrea-md-0-b79b98c6d-l5czh
tkg-vc-antrea-worker-vw5n9          tkg-vc-antrea   true    vsphere://421e616a-0212-5a90-efb7-588b201f9006   tkg-vc-antrea-md-0-b79b98c6d-6kb74
kubo@ubuntu-2004:~$
kubo@ubuntu-2004:~$
kubo@ubuntu-2004:~$ kubectlmgmt get vspheremachine -o wide -A
NAMESPACE    NAME                                CLUSTER         READY   PROVIDERID                                       MACHINE
default      tkg-vc-antrea-control-plane-67x6p   tkg-vc-antrea   true    vsphere://421e327c-d5ab-f0e0-1660-b8008fe1272d   tkg-vc-antrea-control-plane-ltd7f
default      tkg-vc-antrea-worker-v2vnw          tkg-vc-antrea   true    vsphere://421e1267-7ed9-11a7-e2ba-6c13b964fc2f   tkg-vc-antrea-md-0-b79b98c6d-l5czh
default      tkg-vc-antrea-worker-vw5n9          tkg-vc-antrea   true    vsphere://421e616a-0212-5a90-efb7-588b201f9006   tkg-vc-antrea-md-0-b79b98c6d-6kb74
tkg-system   tkg-mgmt-vc-control-plane-wh6dd     tkg-mgmt-vc     true    vsphere://421e5076-ebf4-9df7-13c0-c6bb8b7e2122   tkg-mgmt-vc-control-plane-g7szv
tkg-system   tkg-mgmt-vc-worker-ph8l6            tkg-mgmt-vc     true    vsphere://421e54bc-c5a9-7db4-e9ed-62c83e1f0ef0   tkg-mgmt-vc-md-0-7bd8b547f4-m6pt4

by just carefully deleting finalizers and the above "bounce pod to connected node" trick.

Ill file a follow on issue to have capi-controller-manager proactively fail in events of netsplits

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/1660#issuecomment-1480462363): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api-provider-vsphere

Long running Test: VsphereMachine is gone, but Machine is marked as "Running". #1660

What steps did you take and what happened:

Workarounds ?

Details

Notes

Version

Workaround?