pravega / zookeeper-operator

Kubernetes Operator for Zookeeper
Apache License 2.0
364 stars 203 forks source link

Bug: Stateful set never gets updated because zk node is missing #569

Open AnSmith22 opened 1 year ago

AnSmith22 commented 1 year ago

Description

We find that the operator cannot update the statefulset and returns with “Error doing exists check for znode /zookeeper-operator/zookeeper: Znode exists check failed for path /zookeeper-operator/zookeeper...” and never recovers from this error (even after restart).

We find the reason for this error is that the operator previously crashed right before it creates the zk node (used for storing the cluster size, in reconcileClusterStatus), and meanwhile the spec.replicas get changed. After the operator restarts, it goes into the branch to reconcile the statefulset (in reconcileStatefulSet), where it checks the existence of the zk node before updating the stateful set. And since the zk node does not exist, it will never be able to proceed to update the statefulset. Therefore, the function returns with the error "Error doing exists check for znode /zookeeper-operator/zookeeper" and ends the reconciliation before reaching reconcileClusterStatus again.

As a result, the zk node never gets created, and the update to the stateful set is blocked forever.

Importance

blocker

Location

The bug involves two functions reconcileClusterStatus and reconcileStatefulSet as described above.

Suggestions for an improvement

A potential solution is to create the zk node if existence check fails in reconcileStatefulSet. I can send a PR to help fix the issue.