Closed jayunit100 closed 1 year ago
This seems to be associated w/ a netsplit between some of the nodes on my mgmt cluster (the workers that run capi controller manager) losing connectivity permanantly to the APIServer (kubevip) on the Workload clusters.
Seems like resource reconcilation should still work, though, even if such netsplits happen, as long as the capv-controller manager can still talk to Vsphere instances...
Drew a diagram of the netsplit here after doing more investigation...
Am assuming maybe draining any (mgmt worker VMs) that are unable to connect to (WL apiserver) is a workaround, will test..
ah, "Deleting a kubernetes node associated with a machine is not allowed" after i bounced the capi-controller to a healthy, connected node.... (just a note to self)... will see how to resolve that later....
After trying to
NAME CLUSTER READY PROVIDERID MACHINE
tkg-vc-antrea-control-plane-67x6p tkg-vc-antrea true vsphere://421e327c-d5ab-f0e0-1660-b8008fe1272d tkg-vc-antrea-control-plane-ltd7f
tkg-vc-antrea-worker-8w6wx tkg-vc-antrea
tkg-vc-antrea-worker-lksrl tkg-vc-antrea
tkg-vc-antrea-worker-vw5n9 tkg-vc-antrea true vsphere://421e616a-0212-5a90-efb7-588b201f9006 tkg-vc-antrea-md-0-b79b98c6d-6kb74
kubo@ubuntu-2004:~$ kubectlmgmt get machine -o wide
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
tkg-vc-antrea-control-plane-ltd7f tkg-vc-antrea tkg-vc-antrea-control-plane-ltd7f vsphere://421e327c-d5ab-f0e0-1660-b8008fe1272d Running 53d v1.23.8+vmware.2
tkg-vc-antrea-md-0-b79b98c6d-448ml tkg-vc-antrea Provisioning 8m57s v1.23.8+vmware.2
tkg-vc-antrea-md-0-b79b98c6d-6kb74 tkg-vc-antrea tkg-vc-antrea-md-0-b79b98c6d-6kb74 vsphere://421e616a-0212-5a90-efb7-588b201f9006 Running 8m27s v1.23.8+vmware.2```
ultimately, was able to fix everything back
kubo@ubuntu-2004:~$ kubectlmgmt get vspheremachine -o wide
NAME CLUSTER READY PROVIDERID MACHINE
tkg-vc-antrea-control-plane-67x6p tkg-vc-antrea true vsphere://421e327c-d5ab-f0e0-1660-b8008fe1272d tkg-vc-antrea-control-plane-ltd7f
tkg-vc-antrea-worker-v2vnw tkg-vc-antrea true vsphere://421e1267-7ed9-11a7-e2ba-6c13b964fc2f tkg-vc-antrea-md-0-b79b98c6d-l5czh
tkg-vc-antrea-worker-vw5n9 tkg-vc-antrea true vsphere://421e616a-0212-5a90-efb7-588b201f9006 tkg-vc-antrea-md-0-b79b98c6d-6kb74
kubo@ubuntu-2004:~$
kubo@ubuntu-2004:~$
kubo@ubuntu-2004:~$ kubectlmgmt get vspheremachine -o wide -A
NAMESPACE NAME CLUSTER READY PROVIDERID MACHINE
default tkg-vc-antrea-control-plane-67x6p tkg-vc-antrea true vsphere://421e327c-d5ab-f0e0-1660-b8008fe1272d tkg-vc-antrea-control-plane-ltd7f
default tkg-vc-antrea-worker-v2vnw tkg-vc-antrea true vsphere://421e1267-7ed9-11a7-e2ba-6c13b964fc2f tkg-vc-antrea-md-0-b79b98c6d-l5czh
default tkg-vc-antrea-worker-vw5n9 tkg-vc-antrea true vsphere://421e616a-0212-5a90-efb7-588b201f9006 tkg-vc-antrea-md-0-b79b98c6d-6kb74
tkg-system tkg-mgmt-vc-control-plane-wh6dd tkg-mgmt-vc true vsphere://421e5076-ebf4-9df7-13c0-c6bb8b7e2122 tkg-mgmt-vc-control-plane-g7szv
tkg-system tkg-mgmt-vc-worker-ph8l6 tkg-mgmt-vc true vsphere://421e54bc-c5a9-7db4-e9ed-62c83e1f0ef0 tkg-mgmt-vc-md-0-7bd8b547f4-m6pt4
by just carefully deleting finalizers and the above "bounce pod to connected node" trick.
Ill file a follow on issue to have capi-controller-manager proactively fail in events of netsplits
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/kind bug
Note: This might , thus, be a cluster-api bug instead of a capv bug, but I figure the fact that the VSphereMachine disappeared , means that maybe CAPV could possibly do something to prevent this bug from occuring. Im happy to close this issue and file it in upstream CAPI if folks thing it would be better there.
What steps did you take and what happened:
I ran a test for 50 days, where i periodically used an iptables rule to cut off connectivity from a Workload Cluster, to a Management cluster, i.e.
Workarounds ?
I tried to manually delete the machine, but that seemed to not
Details
This can bee seen in the logs below:
(Note that there is only one node, the CP node, in the
tkg-vc-antrea
workload cluster.Now, below, looking at the "machines" we'll see theres a ghost machine that is running. Which is non existent (tkg-vc-antrea-md-0-b79b98c6d-8sfjs)....
Notes
I see no evidence in the logs that capv-controller-manager is attempting to do anything with the above machine in this machine deployment, i.e.
kubectl --context tkg-mgmt-vc-admin@tkg-mgmt-vc logs -f capv-controller-manager-bcd4d7496-zl4gx -n capv-system | grep tkg-vc-antrea-md-0-
turns up empty.Version
1.3.1
Workaround?
I tried to delete the ghost machine, tkg-vc-antrea-md-0-b79b98c6d-8sfjs, however, it didnt go away. It has a finalizer on it: machine.cluster.x-k8s.io. So I figured maybe CAPI is the one who owns that finalizer, and i looked in capi-controller-manager... What i found was it seemed unhappy ...
So, it seems like I would hypothesize here that, you cant delete machines if their IPs are unreachable, maybe, because the finalizer logic in capi becomes unhappy... And this prevents in cases where theres a netsplit between capi and the nodes it is monitoring, the healthy cleanup of the nodes after the fact.