Operator stuck in PodsReady: False, with all pods running because of wrong Status.MetaRootCreated

lunarfs commented 2 years ago

Description

the operator is stuck in

{"level":"error","ts":1666616251.8807623,"msg":"Reconciler error","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster","zookeeperCluster":{"name":"zookeeper","namespace":"zookeeper"},"namespace":"zookeeper","name":"zookeeper","reconcileID":"225c23ce-c2ee-49e0-9c34-99f47424e374","error":"Error creating cluster metadata path /zookeeper-operator/zookeeper, Error creating parent zkNode: /zookeeper-operator: zk: node already exists","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:227"}

the cluster status is

Status:
  Conditions:
    Status:                  False
    Type:                    PodsReady
    Status:                  False
    Type:                    Upgrading
    Status:                  False
    Type:                    Error
  External Client Endpoint:  N/A
  Internal Client Endpoint:  XXX.XXX.XXX.XXX:2181
  Members:
    Ready:
      zookeeper-2
      zookeeper-0
    Unready:
      zookeeper-1
  Ready Replicas:  3
  Replicas:        3

Not sure how we ended up. here, but the issue is that Status.MetaRootCreated does not reflect the realworld, doing an actual check before creating the node will prevent this failure. the below patch recovered my cluster. I can create a Pullrequest with the changes i you think this makes sense.

Importance

should-have

Location

controllers/zookeepercluster_controller.go

Suggestions for an improvement

diff --git a/controllers/zookeepercluster_controller.go b/controllers/zookeepercluster_controller.go
index 934f405..a9cd966 100644
--- a/controllers/zookeepercluster_controller.go
+++ b/controllers/zookeepercluster_controller.go
@@ -565,11 +565,16 @@ func (r *ZookeeperClusterReconciler) reconcileClusterStatus(instance *zookeeperv
                        return fmt.Errorf("Error creating cluster metaroot. Connect to zk failed %v", err)
                }
                defer r.ZkClient.Close()
-               metaPath := utils.GetMetaPath(instance)
-               r.Log.Info("Connected to zookeeper:", "ZKUri", zkUri, "Creating Path", metaPath)
-               if err := r.ZkClient.CreateNode(instance, metaPath); err != nil {
-                       return fmt.Errorf("Error creating cluster metadata path %s, %v", metaPath, err)
-               }
+        metaPath := utils.GetMetaPath(instance)
+        version, err := r.ZkClient.NodeExists(metaPath)
+        if err != nil {
+            r.Log.Info("Connected to zookeeper:", "ZKUri", zkUri, "Creating Path", metaPath)
+            if err := r.ZkClient.CreateNode(instance, metaPath); err != nil {
+                return fmt.Errorf("Error creating cluster metadata path %s, %v", metaPath, err)
+            }
+        } else {
+            r.Log.Info("Path %s already exists at version %s",metaPath,version)
+        }
                r.Log.Info("Metadata znode created.")
                instance.Status.MetaRootCreated = true
        }

28ori commented 1 year ago

Are you using Openshift? I also had this issue, the Operator is trying to connect directly to the pod and create a znode. Make sure you have a network policy in a way that the Operator pod will be able to reach the zookeeper pod

adrian-salazar-klarrio commented 1 year ago

I am also facing this issue after restoring the PVC's from a Velero backup. The patch works for me and make total sense to check for node existence if you are going to crash because of node existence on creation.

pravega / zookeeper-operator