openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
96 stars 130 forks source link

quorum check occasionally failed when remove one master #1240

Closed lance5890 closed 6 months ago

lance5890 commented 7 months ago
  1. we found the etcd has also rollouted occasionally when we remove one master from cluster(3 masters), and the logs show as follows:

    2024-04-07T04:06:26.674157868-04:00 stderr F I0407 08:06:26.674067       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node node3
    2024-04-07T04:06:26.748289563-04:00 stderr F I0407 08:06:26.748226       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
    2024-04-07T04:06:26.762266273-04:00 stderr F I0407 08:06:26.761362       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
    2024-04-07T04:06:26.762266273-04:00 stderr F E0407 08:06:26.761749       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:26.769422768-04:00 stderr F I0407 08:06:26.768748       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2024-04-01T10:03:40Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"},{"lastTransitionTime":"2024-04-07T06:15:31Z","message":"NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2024-04-07T06:23:20Z","message":"NodeInstallerProgressing: 3 nodes are at revision 46\nEtcdMembersProgressing: No unstarted etcd members found","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2024-04-01T10:06:28Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 46\nEtcdMembersAvailable: 3 members are available","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2024-04-01T10:04:33Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
    2024-04-07T04:06:26.794964201-04:00 stderr F I0407 08:06:26.794326       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available\nEtcdMembersDegraded: No unhealthy members found"
    2024-04-07T04:06:26.796446404-04:00 stderr F I0407 08:06:26.795796       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
    2024-04-07T04:06:26.914137858-04:00 stderr F E0407 08:06:26.909780       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:27.021163775-04:00 stderr F E0407 08:06:27.005267       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:27.021163775-04:00 stderr F I0407 08:06:27.005308       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/etcd-all-certs -n openshift-etcd because it changed
    2024-04-07T04:06:27.021163775-04:00 stderr F I0407 08:06:27.019340       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 47 triggered by "secret/etcd-all-certs has changed"
    2024-04-07T04:06:27.088562786-04:00 stderr F I0407 08:06:27.081163       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
    EtcdCertSignerController2024-04-07T04:06:27.157507402-04:00 stderr F E0407 08:06:27.149860       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:27.237250673-04:00 stderr F E0407 08:06:27.237102       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:27.241366194-04:00 stderr F E0407 08:06:27.240085       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:27.278421294-04:00 stderr F E0407 08:06:27.278373       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available1
  2. that is because the EtcdCertSignerController use the masterNodelister, which found the nodes has changed from 3->2 https://github.com/openshift/cluster-etcd-operator/blob/3d5483e1871ba147a692736c71645175a85769c4/pkg/operator/etcdcertsigner/etcdcertsignercontroller.go#L294-L297

  3. but the IsSafeToUpdateRevision which use the operator.nodestatus to check if the nodes is changed or not, but the operator.nodestatus may not be changed by race condition, and at the meantime( the etcd cluster may be healthy if the node is still running but just lost its master lable). https://github.com/openshift/cluster-etcd-operator/blob/3d5483e1871ba147a692736c71645175a85769c4/pkg/operator/ceohelpers/bootstrap.go#L144-L148

  4. and we should move the IsSafeToUpdateRevision after the certs generation, when the EtcdCertSignerController has regenerated new certs, there must be something changed by the master nodes, and then we can check the IsSafeToUpdateRevision by the same way of listing nodes , not by operator.nodestatus