Zookeeper pod keeps crashing when scaling down and up

hoyhbx commented 2 years ago

Description

(Describe the feature, bug, question, proposal that you are requesting) We found that zookeeper pod keeps crashing after a scale-down workload and a scale-up workload. We found the crash is because the new pods are reusing the undeleted PVC during the scale-down process. However, this problem still persists even when we specify the reclaimPolicy: Delete.

The root cause of this issue is because there is no guard for scale-up before finishing deleting the PVC. We found that zookeeper-operator checks if the upgrade fails before updating the statefulSet, but this check misses checking for the undeleted OrphanPVCs. When reusing the old PVC, the startup script mistakenly thinks the membership list on the old PVC is the up-to-date one and skips updating it. When ZooKeeper starts, it crashes due to inconsistent membership list.

The orphaned PVCs will never get deleted because the operator always waits for the number of ready replicas to equal to desired replicas, which will never happen since one pod keeps crashing.

Importance

(Indicate the importance of this issue to you (blocker, must-have, should-have, nice-to-have)) Importance: must-have

Location

(Where is the piece of code, package, or document affected by this issue?) The PVC cleanup code is here: https://github.com/pravega/zookeeper-operator/blob/c29d475794226ce38de021d72430ccde75a38d2a/controllers/zookeepercluster_controller.go#L707

The logic to upgrade the statefulSet is here: https://github.com/pravega/zookeeper-operator/blob/c29d475794226ce38de021d72430ccde75a38d2a/controllers/zookeepercluster_controller.go#L264

Suggestions for an improvement

And we think that the operator should check if all OrphanPVCs are deleted before proceeding to the next upgrade. We suggest to add a check to check if there is no Orphaned PVCs before updating the statefulSet

(How do you suggest to fix or proceed with this issue?)

hoyhbx commented 1 year ago

This PR (https://github.com/pravega/zookeeper-operator/pull/526) can fix this issue. An alternative (and simpler) fix is to fix the start up script in the zookeeper container, so that the membership list is updated on start. The root cause of the crashing zookeeper pods when reusing the PVC is because the membership list file on the old PVC does not contain the new node.

hoyhbx commented 1 year ago

@anishakj Do you mind to take a look at this? The root cause of this bug is because the restarted zookeeper node has inconsistent membership list. In the zookeeper teardown script, the ID of the node is removed from the membership list from the dynamic configuration file which exist on all PVCs (https://github.com/pravega/zookeeper-operator/blob/72bea545a73130c82b530e508a2a14daaa824435/docker/bin/zookeeperTeardown.sh#L45). But in the zookeeper start up script, the script does not update the dynamic configuration file (https://github.com/pravega/zookeeper-operator/blob/72bea545a73130c82b530e508a2a14daaa824435/docker/bin/zookeeperStart.sh#L131-L137) if the myid and the dynamic config file already exist (this happens if the PVC is reused) (https://github.com/pravega/zookeeper-operator/blob/72bea545a73130c82b530e508a2a14daaa824435/docker/bin/zookeeperStart.sh#L110-L114).

This causes problem when the new zookeeper node reuses an old PVC. During bootup, if the myid file and dynamic config file exist, the startup script skips updating the membership list, assuming it is correct. But in reality, this node was removed from the membership list from the dynamic config file during teardown. This causes the newly started node to keep crashing with the following exception:

│ 2023-04-10 19:34:44,914 [myid:2] - INFO  [main:AbstractConnector@381] - Stopped ServerConnector@5c30a9b0{HTTP/1.1, (http/1.1)}{0.0.0.0:7000}                                                            │
│ 2023-04-10 19:34:44,915 [myid:2] - INFO  [main:ContextHandler@1154] - Stopped o.e.j.s.ServletContextHandler@9d5509a{/,null,STOPPED}                                                                     │
│ 2023-04-10 19:34:44,915 [myid:2] - ERROR [main:QuorumPeerMain@114] - Unexpected exception, exiting abnormally                                                                                           │
│ java.lang.RuntimeException: My id 2 not in the peer list                                                                                                                                                │
│     at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1128)                                                                                                                        │
│     at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229)                                                                                                         │
│     at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137)                                                                                                      │
│     at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)                                                                                                                   │
│ 2023-04-10 19:34:44,916 [myid:2] - INFO  [main:ZKAuditProvider@42] - ZooKeeper audit is disabled.                                                                                                       │
│ 2023-04-10 19:34:44,917 [myid:2] - ERROR [main:ServiceUtils@42] - Exiting JVM with code 1

iampranabroy commented 1 year ago

Hey @hoyhbx @anishakj - Any plans to merge the changes for future releases?

anishakj commented 11 months ago

@hoyhbx Could you please increase the coverage?

pravega / zookeeper-operator