kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.58k stars 1.32k forks source link

KCP remediation when desired replicas are 3 and 3rd node fails. #4472

Closed randomvariable closed 3 years ago

randomvariable commented 3 years ago

Detailed Description

Given the following scenario:

In this circumstance, after the NodeStartupTimeout, MHC will mark the 3rd CP node as unhealthy, but KCP will not remediate the machine because there are only 2 healthy etcd members. KCP posts the MachineOwnerRemediatedCondition false condition on the machine with KCP can't remediate this machine because this could result in etcd loosing quorum.

However, in this case it is safe to delete the 3rd CP node because it never joined the etcd cluster. The remediation logic could be enhanced to match etcd nodes with machines, and then make a decision to delete the 3rd machine.

Anything else you would like to add:

Redacted logs to follow.

[Miscellaneous information that will assist in solving the issue.]

/kind bug

vincepri commented 3 years ago

/assign @fabriziopandini /milestone v0.4

fabriziopandini commented 3 years ago

@randomvariable is it possible to have KCP logs for this use case? Also, which version are you running?

maelk commented 3 years ago

I think this relates to https://github.com/kubernetes-sigs/cluster-api/issues/4365 . We should be able to go further with the remediation if the machine is not part of the etcd cluster.

randomvariable commented 3 years ago

Logs:

State of vSphereMachines:

NAMESPACE   NAME                                        PROVIDERID                                       PHASE
ysb-ns4     ysb-ns4-c4-control-plane-fvf4s                                                               Provisioning
ysb-ns4     ysb-ns4-c4-control-plane-qv2pk              vsphere://423ae9c3-4181-966b-6ee6-ada51ef4e020   Running
ysb-ns4     ysb-ns4-c4-control-plane-srtxw              vsphere://423a7ce1-124a-a9cd-245a-c830e2c19b9e   Running
ysb-ns4     ysb-ns4-c4-workers-8hnw4-5889fb6fb6-hmk6x   vsphere://423a36c3-c147-6b7d-6a96-686b426bc0a1   Running
ysb-ns4     ysb-ns4-c4-workers-8hnw4-5889fb6fb6-sdxrh   vsphere://423ad140-23cf-2384-986c-3c672b10789a   Running
ysb-ns4     ysb-ns4-c4-workers-8hnw4-5889fb6fb6-tjszm   vsphere://423af22f-124e-11ec-b358-b5bcfbd87c6f   Running

fvf4s is the one of interest, stuck in Provisioning. It's inaccessible via VMware vCenter due to a networking glitch that occurred at the moment it turned on. GuestInfo reports no IP addresses, and the machine is sitting at 47Mhz with 200MB memory usage compared to the other CP nodes running at 2.1Ghz.

Describe of kcp:

Name:         ysb-ns4-c4-control-plane
Namespace:    ysb-ns4
Labels:       cluster.x-k8s.io/cluster-name=ysb-ns4-c4
Annotations:  controlplane.cluster.x-k8s.io/skip-coredns: 
              controlplane.cluster.x-k8s.io/skip-kube-proxy: 
API Version:  controlplane.cluster.x-k8s.io/v1alpha3
Kind:         KubeadmControlPlane
Metadata:
  Creation Timestamp:  2021-04-08T08:39:35Z
  Finalizers:
    kubeadm.controlplane.cluster.x-k8s.io
  Generation:  3
  Managed Fields:
  ...
    Manager:    manager
    Operation:  Update
    Time:       2021-04-13T18:46:10Z
  Owner References:
    API Version:           cluster.x-k8s.io/v1alpha3
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Cluster
    Name:                  ysb-ns4-c4
    UID:                   a3a517f8-61c7-4e5c-9562-8957300476b9
  Resource Version:        5590968
  Self Link:               /apis/controlplane.cluster.x-k8s.io/v1alpha3/namespaces/ysb-ns4/kubeadmcontrolplanes/ysb-ns4-c4-control-plane
  UID:                     35d42989-0636-47e9-ab08-57d69e76cf75
Spec:
  Infrastructure Template:
    API Version:  infrastructure.cluster.vmware.com/v1alpha3
    Kind:         WCPMachineTemplate
    Name:         ysb-ns4-c4-control-plane-w756n
    Namespace:    ysb-ns4
  Kubeadm Config Spec:
    Cluster Configuration:
      API Server:
 ...
  Replicas:     3
  Version:      v1.19.7+vmware.1
Status:
  Conditions:
    Last Transition Time:  2021-04-13T18:46:12Z
    Message:               Scaling up control plane to 5 replicas (actual 3)
    Reason:                ScalingUp
    Severity:              Warning
    Status:                False
    Type:                  Ready
    Last Transition Time:  2021-04-08T08:47:11Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2021-04-08T08:39:43Z
    Status:                True
    Type:                  CertificatesAvailable
    Last Transition Time:  2021-04-13T19:15:58Z
    Status:                True
    Type:                  ControlPlaneComponentsHealthy
    Last Transition Time:  2021-04-13T16:24:18Z
    Status:                True
    Type:                  EtcdClusterHealthyCondition
    Last Transition Time:  2021-04-13T18:24:02Z
    Message:               Node failed to report startup in 2h0m0s
    Reason:                NodeStartupTimeout @ Machine/ysb-ns4-c4-control-plane-fvf4s
    Severity:              Warning
    Status:                False
    Type:                  MachinesReady
    Last Transition Time:  2021-04-13T18:46:12Z
    Message:               Scaling up control plane to 5 replicas (actual 3)
    Reason:                ScalingUp
    Severity:              Warning
    Status:                False
    Type:                  Resized
  Initialized:             true
  Observed Generation:     1
  Ready:                   true
  Ready Replicas:          2
  Replicas:                3
  Selector:                cluster.x-k8s.io/cluster-name=ysb-ns4-c4,cluster.x-k8s.io/control-plane
  Unavailable Replicas:    1
  Updated Replicas:        3
Events:
  Type     Reason                 Age   From                              Message
  ----     ------                 ----  ----                              -------
  Warning  ControlPlaneUnhealthy  33m   kubeadm-control-plane-controller  Waiting for control plane to pass preflight checks to continue reconciliation: [machine ysb-ns4-c4-control-plane-fvf4s reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine ysb-ns4-c4-control-plane-fvf4s reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine ysb-ns4-c4-control-plane-fvf4s reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine ysb-ns4-c4-control-plane-fvf4s reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine ysb-ns4-c4-control-plane-fvf4s reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member)]

Relevant CAPI logs:

I0413 19:22:22.837171       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4" 
I0413 19:22:22.837286       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-workers-8hnw4" "namespace"="ysb-ns4" 
I0413 19:22:40.170856       1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation"  "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:40.590411       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:22:40.590491       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:22:40.689253       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4" 
I0413 19:22:41.040122       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-workers-8hnw4" "namespace"="ysb-ns4" 
I0413 19:22:41.141253       1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation"  "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:41.630961       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:22:41.631026       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:22:41.953713       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4" 
I0413 19:22:42.023488       1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation"  "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:42.217046       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4" 
I0413 19:22:42.241332       1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation"  "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:45.904532       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4" 
I0413 19:22:45.906450       1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation"  "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:46.014713       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:22:46.015680       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:22:46.085143       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4" 
I0413 19:22:46.087401       1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation"  "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:46.295597       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:22:46.295659       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:22:53.393989       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4" 
I0413 19:22:53.397133       1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation"  "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:22:53.613260       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4" 
I0413 19:22:53.614590       1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation"  "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:23:10.643572       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:23:10.643629       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:23:40.666989       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:23:40.667155       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:24:10.695606       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:24:10.695729       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:24:40.540926       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-workers-8hnw4" "namespace"="ysb-ns4" 
I0413 19:24:40.713318       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:24:40.713380       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:25:10.729967       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:25:10.730034       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:25:40.755536       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:25:40.755583       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:25:52.161793       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-workers-8hnw4" "namespace"="ysb-ns4" 
I0413 19:25:52.304255       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-workers-8hnw4" "namespace"="ysb-ns4" 
I0413 19:26:00.856301       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4" 
I0413 19:26:00.858921       1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation"  "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:26:10.784607       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:26:10.784677       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:26:13.858604       1 machinehealthcheck_controller.go:109] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="ysb-ns4-c4-control-plane" "namespace"="ysb-ns4" 
I0413 19:26:13.860010       1 machinehealthcheck_controller.go:387] controllers/MachineHealthCheck "msg"="Target has failed health check, marking for remediation"  "message"="Node failed to report startup in 2h0m0s" "reason"="NodeStartupTimeout" "target"="ysb-ns4/ysb-ns4-c4-control-plane/ysb-ns4-c4-control-plane-fvf4s/"
I0413 19:26:40.811901       1 machine_controller_phases.go:278] controllers/Machine "msg"="Infrastructure provider is not ready, requeuing" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 
I0413 19:26:40.811976       1 machine_controller_noderef.go:42] controllers/Machine "msg"="Cannot reconcile Machine's Node, no valid ProviderID yet" "machine"="ysb-ns4-c4-control-plane-fvf4s" "namespace"="ysb-ns4" 

And I don't have the machine YAML, but when I was looking at it over Zoom it had the false condition described in the OP.

randomvariable commented 3 years ago

I believe this is CAPI 0.3.12, but will need to check.

fabriziopandini commented 3 years ago

/lifecycle active