Unable to deploy Zookeeper Cluster

junglie85 commented 1 year ago

Description

I'm trying to deploy a Pravega cluster to EKS but cannot get a zookeeper cluster running. I've deployed the zookeeper operator and zookeeper charts, the logs show no errors that I can see, but there are no zookeeper pods:

kubectl get zookeepercluster -n pravega
NAME        REPLICAS   READY REPLICAS   VERSION   DESIRED VERSION   INTERNAL ENDPOINT   EXTERNAL ENDPOINT   AGE
zookeeper   3                                     0.2.15                                                    3m37s

kubectl describe zookeepercluster/zookeeper -n pravega
Name:         zookeeper
Namespace:    pravega
Labels:       app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=zookeeper
              app.kubernetes.io/version=0.2.15
              helm.sh/chart=zookeeper-0.2.15
Annotations:  meta.helm.sh/release-name: zookeeper
              meta.helm.sh/release-namespace: pravega
API Version:  zookeeper.pravega.io/v1beta1
Kind:         ZookeeperCluster
Metadata:
  Creation Timestamp:  2023-05-10T15:12:04Z
  Generation:          1
  Resource Version:    5054825
Spec:
  Config:
    Pre Alloc Size:  16384
  Image:
    Repository:               pravega/zookeeper
    Tag:                      0.2.15
  Kubernetes Cluster Domain:  cluster.local
  Persistence:
    Reclaim Policy:  Delete
    Spec:
      Resources:
        Requests:
          Storage:         20Gi
      Storage Class Name:  gp3
  Pod:
    Service Account Name:  zookeeper
  Probes:
    Liveness Probe:
      Failure Threshold:      3
      Initial Delay Seconds:  10
      Period Seconds:         10
      Timeout Seconds:        10
    Readiness Probe:
      Failure Threshold:      3
      Initial Delay Seconds:  10
      Period Seconds:         10
      Success Threshold:      1
      Timeout Seconds:        10
  Replicas:                   3
  Storage Type:               persistence
Events:                       <none>

kubectl get job -n pravega
NAME                                       COMPLETIONS   DURATION   AGE
job.batch/zookeeper-post-install-upgrade   0/1           2m34s      2m34s

kubectl get pod-n pravega
NAME                                       READY   STATUS    RESTARTS   AGE
pod/nfs-server-provisioner-0               1/1     Running   0          6h16m
pod/pravega-operator-69f9b6fd48-86942      1/1     Running   0          6h15m
pod/zookeeper-operator-66f95cb4b9-xhzfr    1/1     Running   0          6h28m
pod/zookeeper-post-install-upgrade-4cbtf   0/1     Error     0          2m24s
pod/zookeeper-post-install-upgrade-gpd9b   0/1     Error     0          71s
pod/zookeeper-post-install-upgrade-lv8vp   0/1     Error     0          4m22s
pod/zookeeper-post-install-upgrade-pvbtc   0/1     Error     0          3m27s

kubectl get zookeepercluster -n pravega
NAME        REPLICAS   READY REPLICAS   VERSION   DESIRED VERSION   INTERNAL ENDPOINT   EXTERNAL ENDPOINT   AGE
zookeeper   3                                     0.2.15                                                    7m35s

kubectl logs replicaset.apps/zookeeper-operator-66f95cb4b9 -n pravega
{"level":"info","ts":1683708510.5929866,"logger":"cmd","msg":"zookeeper-operator Version: 0.2.14-16"}
{"level":"info","ts":1683708510.593027,"logger":"cmd","msg":"Git SHA: 28d1f69"}
{"level":"info","ts":1683708510.5930321,"logger":"cmd","msg":"Go Version: go1.19.7"}
{"level":"info","ts":1683708510.5930438,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
I0510 08:48:31.643734       1 request.go:601] Waited for 1.036627133s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/secrets.crossplane.io/v1alpha1?timeout=32s
time="2023-05-10T08:48:38Z" level=info msg="Leader lock zookeeper-operator-lock not found in namespace pravega"
{"level":"info","ts":1683708518.8782103,"logger":"leader","msg":"Trying to become the leader."}
I0510 08:48:41.679506       1 request.go:601] Waited for 2.794054996s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/mediapackage.aws.upbound.io/v1beta1?timeout=32s
{"level":"info","ts":1683708527.1470127,"logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":1683708527.1528108,"logger":"leader","msg":"Became the leader."}
I0510 08:48:51.703887       1 request.go:601] Waited for 4.539519742s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/imagebuilder.aws.upbound.io/v1beta1?timeout=32s
{"level":"info","ts":1683708535.425317,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":"127.0.0.1:6000"}
{"level":"info","ts":1683708535.425616,"logger":"cmd","msg":"starting manager"}
{"level":"info","ts":1683708535.426141,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"127.0.0.1:6000"}
{"level":"info","ts":1683708535.4262583,"msg":"Starting EventSource","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster","source":"kind source: *v1beta1.ZookeeperCluster"}
{"level":"info","ts":1683708535.4264324,"msg":"Starting EventSource","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster","source":"kind source: *v1.StatefulSet"}
{"level":"info","ts":1683708535.426441,"msg":"Starting EventSource","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster","source":"kind source: *v1.Service"}
{"level":"info","ts":1683708535.4264479,"msg":"Starting EventSource","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster","source":"kind source: *v1.Pod"}
{"level":"info","ts":1683708535.4264512,"msg":"Starting Controller","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster"}
{"level":"info","ts":1683708535.5296795,"msg":"Starting workers","controller":"zookeepercluster","controllerGroup":"zookeeper.pravega.io","controllerKind":"ZookeeperCluster","worker count":1}

I wondered if storage was an issue, and have tried with both persistence and ephemeral options with no success. I have 3 nodes in my kubernetes node group (t3.medium's), which I assume is sufficient. The times shown above are short, but I've waited over an hour and still nothing.

How can I debug this?

Importance

Blocker.

subhranil05 commented 1 year ago

Hi @junglie85 Did you check the error logs of post-install pods? or can you post the same ?

junglie85 commented 1 year ago

Hey @subhranil05 the logs aren't very helpful...

kubectl -n pravega logs pod/zookeeper-post-install-upgrade-tg6q7
Checking for ready ZK replicas
ZK replicas not ready

junglie85 commented 1 year ago

I think I've found the problem:

watchNamespace:
  - pravega

It should be:

watchNamespace: pravega

Is there any reason why the chart doesn't accept the list of namespaces to watch as a yaml list and convert them to as string if needed?

{{ join "," .Values.watchNamespace }}

davizucon commented 1 year ago

Hello @junglie85, Did you find a solution, I got same error, even setting namespace or empty string... zookeeper-0 pod not getting health.

parekhcoder commented 8 months ago

I already have one zookeeper cluster. I tried to install another one in different namespace with watchNamespace but doesn't seem it honors the same.

When I tried to uninstall the operator, Its not uninstalling it and gives error about already zookeeper instances (of previous installation) are running.

RaulGracia commented 8 months ago

We have contributed new guidelines to deploy Pravega on EKS: https://github.com/pravega/pravega/tree/master/deployment/aws-eks Once you deploy the cluster with a volume provisioner and with the right permissions, I had no problem deploying Zookeeper. Hope it helps.

parekhcoder commented 8 months ago

I am trying on our own Kubernetes cluster.

pravega / zookeeper-operator

Unable to deploy Zookeeper Cluster #555

Description

Importance