quorum check when update secret

lance5890 commented 7 months ago

we found the etcd has also rollouted occasionally when we remove one master from cluster(3 masters)， and the logs show as follows:

2024-04-07T04:06:26.674157868-04:00 stderr F I0407 08:06:26.674067       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node node3
2024-04-07T04:06:26.748289563-04:00 stderr F I0407 08:06:26.748226       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
2024-04-07T04:06:26.762266273-04:00 stderr F I0407 08:06:26.761362       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
2024-04-07T04:06:26.762266273-04:00 stderr F E0407 08:06:26.761749       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:26.769422768-04:00 stderr F I0407 08:06:26.768748       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2024-04-01T10:03:40Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"},{"lastTransitionTime":"2024-04-07T06:15:31Z","message":"NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2024-04-07T06:23:20Z","message":"NodeInstallerProgressing: 3 nodes are at revision 46\nEtcdMembersProgressing: No unstarted etcd members found","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2024-04-01T10:06:28Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 46\nEtcdMembersAvailable: 3 members are available","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2024-04-01T10:04:33Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
2024-04-07T04:06:26.794964201-04:00 stderr F I0407 08:06:26.794326       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available\nEtcdMembersDegraded: No unhealthy members found"
2024-04-07T04:06:26.796446404-04:00 stderr F I0407 08:06:26.795796       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
2024-04-07T04:06:26.914137858-04:00 stderr F E0407 08:06:26.909780       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:27.021163775-04:00 stderr F E0407 08:06:27.005267       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:27.021163775-04:00 stderr F I0407 08:06:27.005308       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/etcd-all-certs -n openshift-etcd because it changed
2024-04-07T04:06:27.021163775-04:00 stderr F I0407 08:06:27.019340       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 47 triggered by "secret/etcd-all-certs has changed"
2024-04-07T04:06:27.088562786-04:00 stderr F I0407 08:06:27.081163       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
EtcdCertSignerController2024-04-07T04:06:27.157507402-04:00 stderr F E0407 08:06:27.149860       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:27.237250673-04:00 stderr F E0407 08:06:27.237102       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:27.241366194-04:00 stderr F E0407 08:06:27.240085       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
2024-04-07T04:06:27.278421294-04:00 stderr F E0407 08:06:27.278373       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available1

as the EtcdCertSignerController check the quorum at the beginning, which may be missed when apply the secret
as the etcdendpointscontroller check the quorum before update the cm https://github.com/openshift/cluster-etcd-operator/blob/3d5483e1871ba147a692736c71645175a85769c4/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go#L146-L153
I found this issue in ocp 4.12, but i think all the related versions have this issue

tjungblu commented 7 months ago

I think the race condition is absolutely valid @lance5890 - I'd put it into our bug backlog. The PR wasn't entirely wrong either, I would need to check whether we mutate some state in between that we won't get back after and update accordingly.

I think the next three weeks might be busier for us because of the upcoming 4.16 feature freeze, but after that we will definitely take a look at it.

tjungblu commented 7 months ago

Here's the bug ticket for further reference :) https://issues.redhat.com/browse/OCPBUGS-31849

lance5890 commented 7 months ago

I think the race condition is absolutely valid @lance5890 - I'd put it into our bug backlog. The PR wasn't entirely wrong either, I would need to check whether we mutate some state in between that we won't get back after and update accordingly.

I think the next three weeks might be busier for us because of the upcoming 4.16 feature freeze, but after that we will definitely take a look at it.

After thinking for a while, I think this PR will not address this issue completely, I have prepared another PR to fix this, which I think will address the issue completely, add and modify the UT at the same time. maybe tomorrow

tjungblu commented 5 months ago

@lance5890 I think I found a more reliable way to tackle this, directly at the root of the static pod installation controller: https://issues.redhat.com/browse/ETCD-612

So that way we don't have to continue to spread this check everywhere.

lance5890 commented 5 months ago

@lance5890 I think I found a more reliable way to tackle this, directly at the root of the static pod installation controller: https://issues.redhat.com/browse/ETCD-612

So that way we don't have to continue to spread this check everywhere.

yep，it is better to stop a new installer pod when quorum check failed

openshift / cluster-etcd-operator

quorum check when update secret #1237