Closed randomvariable closed 3 years ago
/assign @fabriziopandini /milestone v0.4
@randomvariable is it possible to have KCP logs for this use case? Also, which version are you running?
I think this relates to https://github.com/kubernetes-sigs/cluster-api/issues/4365 . We should be able to go further with the remediation if the machine is not part of the etcd cluster.
Logs:
State of vSphereMachines:
NAMESPACE NAME PROVIDERID PHASE
ysb-ns4 ysb-ns4-c4-control-plane-fvf4s Provisioning
ysb-ns4 ysb-ns4-c4-control-plane-qv2pk vsphere://423ae9c3-4181-966b-6ee6-ada51ef4e020 Running
ysb-ns4 ysb-ns4-c4-control-plane-srtxw vsphere://423a7ce1-124a-a9cd-245a-c830e2c19b9e Running
ysb-ns4 ysb-ns4-c4-workers-8hnw4-5889fb6fb6-hmk6x vsphere://423a36c3-c147-6b7d-6a96-686b426bc0a1 Running
ysb-ns4 ysb-ns4-c4-workers-8hnw4-5889fb6fb6-sdxrh vsphere://423ad140-23cf-2384-986c-3c672b10789a Running
ysb-ns4 ysb-ns4-c4-workers-8hnw4-5889fb6fb6-tjszm vsphere://423af22f-124e-11ec-b358-b5bcfbd87c6f Running
fvf4s
is the one of interest, stuck in Provisioning. It's inaccessible via VMware vCenter due to a networking glitch that occurred at the moment it turned on. GuestInfo reports no IP addresses, and the machine is sitting at 47Mhz with 200MB memory usage compared to the other CP nodes running at 2.1Ghz.
Describe of kcp:
Name: ysb-ns4-c4-control-plane
Namespace: ysb-ns4
Labels: cluster.x-k8s.io/cluster-name=ysb-ns4-c4
Annotations: controlplane.cluster.x-k8s.io/skip-coredns:
controlplane.cluster.x-k8s.io/skip-kube-proxy:
API Version: controlplane.cluster.x-k8s.io/v1alpha3
Kind: KubeadmControlPlane
Metadata:
Creation Timestamp: 2021-04-08T08:39:35Z
Finalizers:
kubeadm.controlplane.cluster.x-k8s.io
Generation: 3
Managed Fields:
...
Manager: manager
Operation: Update
Time: 2021-04-13T18:46:10Z
Owner References:
API Version: cluster.x-k8s.io/v1alpha3
Block Owner Deletion: true
Controller: true
Kind: Cluster
Name: ysb-ns4-c4
UID: a3a517f8-61c7-4e5c-9562-8957300476b9
Resource Version: 5590968
Self Link: /apis/controlplane.cluster.x-k8s.io/v1alpha3/namespaces/ysb-ns4/kubeadmcontrolplanes/ysb-ns4-c4-control-plane
UID: 35d42989-0636-47e9-ab08-57d69e76cf75
Spec:
Infrastructure Template:
API Version: infrastructure.cluster.vmware.com/v1alpha3
Kind: WCPMachineTemplate
Name: ysb-ns4-c4-control-plane-w756n
Namespace: ysb-ns4
Kubeadm Config Spec:
Cluster Configuration:
API Server:
...
Replicas: 3
Version: v1.19.7+vmware.1
Status:
Conditions:
Last Transition Time: 2021-04-13T18:46:12Z
Message: Scaling up control plane to 5 replicas (actual 3)
Reason: ScalingUp
Severity: Warning
Status: False
Type: Ready
Last Transition Time: 2021-04-08T08:47:11Z
Status: True
Type: Available
Last Transition Time: 2021-04-08T08:39:43Z
Status: True
Type: CertificatesAvailable
Last Transition Time: 2021-04-13T19:15:58Z
Status: True
Type: ControlPlaneComponentsHealthy
Last Transition Time: 2021-04-13T16:24:18Z
Status: True
Type: EtcdClusterHealthyCondition
Last Transition Time: 2021-04-13T18:24:02Z
Message: Node failed to report startup in 2h0m0s
Reason: NodeStartupTimeout @ Machine/ysb-ns4-c4-control-plane-fvf4s
Severity: Warning
Status: False
Type: MachinesReady
Last Transition Time: 2021-04-13T18:46:12Z
Message: Scaling up control plane to 5 replicas (actual 3)
Reason: ScalingUp
Severity: Warning
Status: False
Type: Resized
Initialized: true
Observed Generation: 1
Ready: true
Ready Replicas: 2
Replicas: 3
Selector: cluster.x-k8s.io/cluster-name=ysb-ns4-c4,cluster.x-k8s.io/control-plane
Unavailable Replicas: 1
Updated Replicas: 3
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ControlPlaneUnhealthy 33m kubeadm-control-plane-controller Waiting for control plane to pass preflight checks to continue reconciliation: [machine ysb-ns4-c4-control-plane-fvf4s reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine ysb-ns4-c4-control-plane-fvf4s reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine ysb-ns4-c4-control-plane-fvf4s reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine ysb-ns4-c4-control-plane-fvf4s reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine ysb-ns4-c4-control-plane-fvf4s reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member)]
Relevant CAPI logs:
I0413 19:22:22.837171 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4"
I0413 19:22:22.837286 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-workers-8hnw4" "namespace"="ysb-ns4"
I0413 19:22:40.170856 1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation" "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:40.590411 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:22:40.590491 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:22:40.689253 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4"
I0413 19:22:41.040122 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-workers-8hnw4" "namespace"="ysb-ns4"
I0413 19:22:41.141253 1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation" "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:41.630961 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:22:41.631026 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:22:41.953713 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4"
I0413 19:22:42.023488 1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation" "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:42.217046 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4"
I0413 19:22:42.241332 1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation" "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:45.904532 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4"
I0413 19:22:45.906450 1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation" "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:46.014713 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:22:46.015680 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:22:46.085143 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4"
I0413 19:22:46.087401 1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation" "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:46.295597 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:22:46.295659 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:22:53.393989 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4"
I0413 19:22:53.397133 1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation" "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:53.613260 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4"
I0413 19:22:53.614590 1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation" "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:23:10.643572 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:23:10.643629 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:23:40.666989 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:23:40.667155 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:24:10.695606 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:24:10.695729 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:24:40.540926 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-workers-8hnw4" "namespace"="ysb-ns4"
I0413 19:24:40.713318 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:24:40.713380 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:25:10.729967 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:25:10.730034 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:25:40.755536 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:25:40.755583 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:25:52.161793 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-workers-8hnw4" "namespace"="ysb-ns4"
I0413 19:25:52.304255 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-workers-8hnw4" "namespace"="ysb-ns4"
I0413 19:26:00.856301 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4"
I0413 19:26:00.858921 1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation" "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:26:10.784607 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:26:10.784677 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:26:13.858604 1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4"
I0413 19:26:13.860010 1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation" "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:26:40.811901 1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
I0413 19:26:40.811976 1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4"
And I don't have the machine YAML, but when I was looking at it over Zoom it had the false condition described in the OP.
I believe this is CAPI 0.3.12, but will need to check.
/lifecycle active
Detailed Description
Given the following scenario:
In this circumstance, after the NodeStartupTimeout, MHC will mark the 3rd CP node as unhealthy, but KCP will not remediate the machine because there are only 2 healthy etcd members. KCP posts the MachineOwnerRemediatedCondition false condition on the machine with
KCP can't remediate this machine because this could result in etcd loosing quorum
.However, in this case it is safe to delete the 3rd CP node because it never joined the etcd cluster. The remediation logic could be enhanced to match etcd nodes with machines, and then make a decision to delete the 3rd machine.
Anything else you would like to add:
Redacted logs to follow.
[Miscellaneous information that will assist in solving the issue.]
/kind bug