quorum check occasionally failed when remove one master

we found the etcd has also rollouted occasionally when we remove one master from cluster(3 masters)， and the logs show as follows:

2024-04-07T04:06:26.674157868-04:00 stderr F I0407 08:06:26.674067       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node node3
2024-04-07T04:06:26.748289563-04:00 stderr F I0407 08:06:26.748226       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
2024-04-07T04:06:26.762266273-04:00 stderr F I0407 08:06:26.761362       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
2024-04-07T04:06:26.762266273-04:00 stderr F E0407 08:06:26.761749       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:26.769422768-04:00 stderr F I0407 08:06:26.768748       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2024-04-01T10:03:40Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"},{"lastTransitionTime":"2024-04-07T06:15:31Z","message":"NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2024-04-07T06:23:20Z","message":"NodeInstallerProgressing: 3 nodes are at revision 46\nEtcdMembersProgressing: No unstarted etcd members found","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2024-04-01T10:06:28Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 46\nEtcdMembersAvailable: 3 members are available","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2024-04-01T10:04:33Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
2024-04-07T04:06:26.794964201-04:00 stderr F I0407 08:06:26.794326       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available\nEtcdMembersDegraded: No unhealthy members found"
2024-04-07T04:06:26.796446404-04:00 stderr F I0407 08:06:26.795796       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
2024-04-07T04:06:26.914137858-04:00 stderr F E0407 08:06:26.909780       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:27.021163775-04:00 stderr F E0407 08:06:27.005267       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:27.021163775-04:00 stderr F I0407 08:06:27.005308       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/etcd-all-certs -n openshift-etcd because it changed
2024-04-07T04:06:27.021163775-04:00 stderr F I0407 08:06:27.019340       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 47 triggered by "secret/etcd-all-certs has changed"
2024-04-07T04:06:27.088562786-04:00 stderr F I0407 08:06:27.081163       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
EtcdCertSignerController2024-04-07T04:06:27.157507402-04:00 stderr F E0407 08:06:27.149860       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:27.237250673-04:00 stderr F E0407 08:06:27.237102       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:27.241366194-04:00 stderr F E0407 08:06:27.240085       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:27.278421294-04:00 stderr F E0407 08:06:27.278373       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available1

that is because the EtcdCertSignerController use the masterNodelister, which found the nodes has changed from 3->2 https://github.com/openshift/cluster-etcd-operator/blob/3d5483e1871ba147a692736c71645175a85769c4/pkg/operator/etcdcertsigner/etcdcertsignercontroller.go#L294-L297
but the IsSafeToUpdateRevision which use the operator.nodestatus to check if the nodes is changed or not, but the operator.nodestatus may not be changed by race condition, and at the meantime( the etcd cluster may be healthy if the node is still running but just lost its master lable). https://github.com/openshift/cluster-etcd-operator/blob/3d5483e1871ba147a692736c71645175a85769c4/pkg/operator/ceohelpers/bootstrap.go#L144-L148
and we should move the IsSafeToUpdateRevision after the certs generation, when the EtcdCertSignerController has regenerated new certs, there must be something changed by the master nodes, and then we can check the IsSafeToUpdateRevision by the same way of listing nodes , not by operator.nodestatus

openshift / cluster-etcd-operator

quorum check occasionally failed when remove one master #1240