We find that the operator cannot update the statefulset and returns with “Error doing exists check for znode /zookeeper-operator/zookeeper: Znode exists check failed for path /zookeeper-operator/zookeeper...” and never recovers from this error (even after restart).
We find the reason for this error is that the operator previously crashed right before it creates the zk node (used for storing the cluster size, in reconcileClusterStatus), and meanwhile the spec.replicas get changed. After the operator restarts, it goes into the branch to reconcile the statefulset (in reconcileStatefulSet), where it checks the existence of the zk node before updating the stateful set. And since the zk node does not exist, it will never be able to proceed to update the statefulset. Therefore, the function returns with the error "Error doing exists check for znode /zookeeper-operator/zookeeper" and ends the reconciliation before reaching reconcileClusterStatus again.
As a result, the zk node never gets created, and the update to the stateful set is blocked forever.
Importance
blocker
Location
The bug involves two functions reconcileClusterStatus and reconcileStatefulSet as described above.
Suggestions for an improvement
A potential solution is to create the zk node if existence check fails in reconcileStatefulSet. I can send a PR to help fix the issue.
Description
We find that the operator cannot update the statefulset and returns with “Error doing exists check for znode /zookeeper-operator/zookeeper: Znode exists check failed for path /zookeeper-operator/zookeeper...” and never recovers from this error (even after restart).
We find the reason for this error is that the operator previously crashed right before it creates the zk node (used for storing the cluster size, in
reconcileClusterStatus
), and meanwhile thespec.replicas
get changed. After the operator restarts, it goes into the branch to reconcile the statefulset (inreconcileStatefulSet
), where it checks the existence of the zk node before updating the stateful set. And since the zk node does not exist, it will never be able to proceed to update the statefulset. Therefore, the function returns with the error "Error doing exists check for znode /zookeeper-operator/zookeeper" and ends the reconciliation before reachingreconcileClusterStatus
again.As a result, the zk node never gets created, and the update to the stateful set is blocked forever.
Importance
blocker
Location
The bug involves two functions
reconcileClusterStatus
andreconcileStatefulSet
as described above.Suggestions for an improvement
A potential solution is to create the zk node if existence check fails in
reconcileStatefulSet
. I can send a PR to help fix the issue.