pravega / zookeeper-operator

Kubernetes Operator for Zookeeper
Apache License 2.0
368 stars 203 forks source link

Inconsistent member.ready & readyReplicas #614

Open OneCricketeer opened 2 months ago

OneCricketeer commented 2 months ago

Description

The post-install hook is failing to check the Ready status, and we notice the following output in the status, yet the "unready" pod seems to have very similar log output to the "ready" ones. Restarting the unready pod has not helped.

  members:
    ready:
      - app-zookeeper-3
      - app-zookeeper-2
      - app-zookeeper-0
      - app-zookeeper-1
    unready:
      - app-zookeeper-4
  readyReplicas: 5
  replicas: 5

Importance

must-have

Location

(Where is the piece of code, package, or document affected by this issue?)

Suggestions for an improvement

readyReplicas should respect len(members.ready)

OneCricketeer commented 2 months ago
Unready pod logs

2024-08-24 02:40:09,063 [myid:5] - INFO  [NIOWorkerThread-5:NIOServerCnxn@518] - Processing ruok command from /127.0.0.1:38160
2024-08-24 02:40:09,069 [myid:5] - INFO  [NIOWorkerThread-4:NIOServerCnxn@518] - Processing ruok command from /127.0.0.1:38162
2024-08-24 02:40:09,568 [myid:5] - WARN  [NIOWorkerThread-3:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:10,378 [myid:5] - WARN  [NIOWorkerThread-6:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:11,256 [myid:5] - WARN  [NIOWorkerThread-9:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:12,902 [myid:5] - WARN  [NIOWorkerThread-7:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:13,711 [myid:5] - WARN  [NIOWorkerThread-10:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:14,589 [myid:5] - WARN  [NIOWorkerThread-8:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:14,592 [myid:5] - WARN  [NIOWorkerThread-11:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:15,640 [myid:5] - WARN  [NIOWorkerThread-12:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:16,235 [myid:5] - WARN  [NIOWorkerThread-13:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:17,046 [myid:5] - WARN  [NIOWorkerThread-16:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:17,923 [myid:5] - WARN  [NIOWorkerThread-15:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:18,636 [myid:5] - WARN  [NIOWorkerThread-14:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:19,063 [myid:5] - INFO  [NIOWorkerThread-17:NIOServerCnxn@518] - Processing ruok command from /127.0.0.1:59816
2024-08-24 02:40:19,065 [myid:5] - INFO  [NIOWorkerThread-18:NIOServerCnxn@518] - Processing ruok command from /127.0.0.1:59822
2024-08-24 02:40:19,568 [myid:5] - WARN  [NIOWorkerThread-19:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:20,378 [myid:5] - WARN  [NIOWorkerThread-21:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:21,256 [myid:5] - WARN  [NIOWorkerThread-20:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:22,902 [myid:5] - WARN  [NIOWorkerThread-22:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:23,712 [myid:5] - WARN  [NIOWorkerThread-25:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:24,590 [myid:5] - WARN  [NIOWorkerThread-24:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:26,235 [myid:5] - WARN  [NIOWorkerThread-27:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:27,046 [myid:5] - WARN  [NIOWorkerThread-23:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:27,923 [myid:5] - WARN  [NIOWorkerThread-26:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:28,973 [myid:5] - WARN  [NIOWorkerThread-28:NIOServerCnxn@366] - Unable to read additional data from client sessionid 0x0, likely client has closed socket
2024-08-24 02:40:29,063 [myid:5] - INFO  [NIOWorkerThread-30:NIOServerCnxn@518] - Processing ruok command from /127.0.0.1:36344
2024-08-24 02:40:29,066 [myid:5] - INFO  [NIOWorkerThread-29:NIOServerCnxn@518] - Processing ruok command from /127.0.0.1:36356

Error seen in the operator, otherwise says it is connected

"error":"Error creating cluster metadata path /zookeeper-operator/app-zookeeper, Error creating parent zkNode: /zookeeper-operator: zk: node already exists","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:234"}

Error from post-install hook

Checking for ready ZK replicas
I0824 02:57:12.548963 [request.go:665] Waited for 1.163382298s due to client-side throttling, not priority and fairness, request: GET:https://192.168.192.1:443/apis/rbac.authorization.k8s.io/v1?timeout=32s
ZK replicas not ready
OneCricketeer commented 2 months ago

Container Statuses for zookeeper-4 - Both say ready: true

status:
  conditions:
    - lastProbeTime: null
      lastTransitionTime: '2024-08-23T15:40:33Z'
      status: 'True'
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: '2024-08-23T15:41:09Z'
      status: 'True'
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: '2024-08-23T15:41:09Z'
      status: 'True'
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: '2024-08-23T15:40:33Z'
      status: 'True'
      type: PodScheduled
  containerStatuses:
    - containerID: 'cri-o://0cb757458dbe524f5c75d7c63f37c2ba9baacf6503e579016ee1e7f37e419aa5'
      image: >-
        redacted
      imageID: >-
        redacted
      lastState: {}
      name: fluent-bit
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: '2024-08-23T15:40:39Z'
    - containerID: 'cri-o://1de7506155cfb27de7cad0244c4ead699fdff2f1d6ef1c258da747a35e6835c2'
      image: 'docker.io/redacted/zookeeper:3.5.7'
      imageID: >-
        docker.io/redacted/zookeeper@sha256:f032bd83682738f32757bf1f365ed9de8ee7aa41015a010083b5c3074e5f2659
      lastState: {}
      name: zookeeper
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: '2024-08-23T15:40:39Z'