openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
96 stars 130 forks source link

quorum check when update secret #1237

Closed lance5890 closed 7 months ago

lance5890 commented 7 months ago
  1. we found the etcd has also rollouted occasionally when we remove one master from cluster(3 masters), and the logs show as follows:

    2024-04-07T04:06:26.674157868-04:00 stderr F I0407 08:06:26.674067       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node node3
    2024-04-07T04:06:26.748289563-04:00 stderr F I0407 08:06:26.748226       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
    2024-04-07T04:06:26.762266273-04:00 stderr F I0407 08:06:26.761362       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
    2024-04-07T04:06:26.762266273-04:00 stderr F E0407 08:06:26.761749       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:26.769422768-04:00 stderr F I0407 08:06:26.768748       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2024-04-01T10:03:40Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"},{"lastTransitionTime":"2024-04-07T06:15:31Z","message":"NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2024-04-07T06:23:20Z","message":"NodeInstallerProgressing: 3 nodes are at revision 46\nEtcdMembersProgressing: No unstarted etcd members found","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2024-04-01T10:06:28Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 46\nEtcdMembersAvailable: 3 members are available","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2024-04-01T10:04:33Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
    2024-04-07T04:06:26.794964201-04:00 stderr F I0407 08:06:26.794326       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available\nEtcdMembersDegraded: No unhealthy members found"
    2024-04-07T04:06:26.796446404-04:00 stderr F I0407 08:06:26.795796       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
    2024-04-07T04:06:26.914137858-04:00 stderr F E0407 08:06:26.909780       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:27.021163775-04:00 stderr F E0407 08:06:27.005267       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:27.021163775-04:00 stderr F I0407 08:06:27.005308       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/etcd-all-certs -n openshift-etcd because it changed
    2024-04-07T04:06:27.021163775-04:00 stderr F I0407 08:06:27.019340       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 47 triggered by "secret/etcd-all-certs has changed"
    2024-04-07T04:06:27.088562786-04:00 stderr F I0407 08:06:27.081163       1 quorumguardcleanupcontroller.go:133] 3/2 guard pods ready. Waiting until all new guard pods are ready
    EtcdCertSignerController2024-04-07T04:06:27.157507402-04:00 stderr F E0407 08:06:27.149860       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:27.237250673-04:00 stderr F E0407 08:06:27.237102       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:27.241366194-04:00 stderr F E0407 08:06:27.240085       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
    2024-04-07T04:06:27.278421294-04:00 stderr F E0407 08:06:27.278373       1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available1
  2. as the EtcdCertSignerController check the quorum at the beginning, which may be missed when apply the secret

  3. as the etcdendpointscontroller check the quorum before update the cm https://github.com/openshift/cluster-etcd-operator/blob/3d5483e1871ba147a692736c71645175a85769c4/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go#L146-L153

  4. I found this issue in ocp 4.12, but i think all the related versions have this issue

tjungblu commented 7 months ago

I think the race condition is absolutely valid @lance5890 - I'd put it into our bug backlog. The PR wasn't entirely wrong either, I would need to check whether we mutate some state in between that we won't get back after and update accordingly.

I think the next three weeks might be busier for us because of the upcoming 4.16 feature freeze, but after that we will definitely take a look at it.

tjungblu commented 7 months ago

Here's the bug ticket for further reference :) https://issues.redhat.com/browse/OCPBUGS-31849

lance5890 commented 7 months ago

I think the race condition is absolutely valid @lance5890 - I'd put it into our bug backlog. The PR wasn't entirely wrong either, I would need to check whether we mutate some state in between that we won't get back after and update accordingly.

I think the next three weeks might be busier for us because of the upcoming 4.16 feature freeze, but after that we will definitely take a look at it.

After thinking for a while, I think this PR will not address this issue completely, I have prepared another PR to fix this, which I think will address the issue completely, add and modify the UT at the same time. maybe tomorrow

tjungblu commented 5 months ago

@lance5890 I think I found a more reliable way to tackle this, directly at the root of the static pod installation controller: https://issues.redhat.com/browse/ETCD-612

So that way we don't have to continue to spread this check everywhere.

lance5890 commented 5 months ago

@lance5890 I think I found a more reliable way to tackle this, directly at the root of the static pod installation controller: https://issues.redhat.com/browse/ETCD-612

So that way we don't have to continue to spread this check everywhere.

yep,it is better to stop a new installer pod when quorum check failed