pravega / zookeeper-operator

Kubernetes Operator for Zookeeper
Apache License 2.0
368 stars 208 forks source link

Second ZK pod doesn't starts due to 'java.lang.RuntimeException: My id 2 not in the peer list' exception #315

Closed BassinD closed 3 years ago

BassinD commented 3 years ago

Description

We are using zookeeper v.0.2.9. Sometimes (not in all environements) zookeeper-1 pod unable to start due to RuntimeException. Pod 's log:

kubectl logs deployment-zookeeper-1
+ source /conf/env.sh
++ DOMAIN=deployment-zookeeper-headless.objectstore-large.svc.cluster.local
++ QUORUM_PORT=2888
++ LEADER_PORT=3888
++ CLIENT_HOST=deployment-zookeeper-client
++ CLIENT_PORT=9277
++ CLUSTER_NAME=deployment-zookeeper
++ CLUSTER_SIZE=3
+ source /usr/local/bin/zookeeperFunctions.sh
++ set -ex
++ hostname -s
+ HOST=deployment-zookeeper-1
+ DATA_DIR=/data
+ MYID_FILE=/data/myid
+ LOG4J_CONF=/conf/log4j-quiet.properties
+ DYNCONFIG=/data/zoo.cfg.dynamic
+ STATIC_CONFIG=/data/conf/zoo.cfg
+ [[ deployment-zookeeper-1 =~ (.*)-([0-9]+)$ ]]
+ NAME=deployment-zookeeper
+ ORD=1
+ MYID=2
+ WRITE_CONFIGURATION=true
+ REGISTER_NODE=true
+ ONDISK_MYID_CONFIG=false
+ ONDISK_DYN_CONFIG=false
+ '[' -f /data/myid ']'
++ cat /data/myid
+ EXISTING_ID=2
+ [[ 2 == \2 ]]
+ [[ -f /data/conf/zoo.cfg ]]
+ ONDISK_MYID_CONFIG=true
+ '[' -f /data/zoo.cfg.dynamic ']'
+ ONDISK_DYN_CONFIG=true
+ set +e
+ [[ -n '' ]]
+ set -e
+ set +e
+ nslookup deployment-zookeeper-headless.objectstore-large.svc.cluster.local
Server:     10.96.0.10
Address:    10.96.0.10#53

Name:   deployment-zookeeper-headless.objectstore-large.svc.cluster.local
Address: 192.168.55.175

+ [[ 0 -eq 0 ]]
+ ACTIVE_ENSEMBLE=true
+ [[ true == true ]]
+ [[ true == true ]]
+ WRITE_CONFIGURATION=false
+ [[ true == false ]]
+ [[ true == false ]]
+ [[ true == false ]]
+ REGISTER_NODE=false
+ [[ false == true ]]
+ [[ false == true ]]
+ ZOOCFGDIR=/data/conf
+ export ZOOCFGDIR
+ echo Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration
Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration
+ [[ ! -d /data/conf ]]
+ echo Copying the /conf/zoo.cfg contents except the dynamic config file during restart
Copying the /conf/zoo.cfg contents except the dynamic config file during restart
++ head -n -1 /conf/zoo.cfg
++ tail -n 1 /data/conf/zoo.cfg
+ echo -e '4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok
dataDir=/data
standaloneEnabled=false
reconfigEnabled=true
skipACL=yes
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
initLimit=10
syncLimit=2
tickTime=2000
quorumListenOnAllIPs=false\ndynamicConfigFile=/data/zoo.cfg.dynamic'
+ cp -f /conf/log4j.properties /data/conf
+ cp -f /conf/log4j-quiet.properties /data/conf
+ cp -f /conf/env.sh /data/conf
+ '[' -f /data/zoo.cfg.dynamic ']'
Starting zookeeper service
+ echo Starting zookeeper service
+ zkServer.sh --config /data/conf start-foreground
ZooKeeper JMX enabled by default
Using config: /data/conf/zoo.cfg
2021-03-29 13:33:55,679 [myid:] - INFO  [main:QuorumPeerConfig@173] - Reading configuration from: /data/conf/zoo.cfg
2021-03-29 13:33:55,686 [myid:] - INFO  [main:QuorumPeerConfig@450] - clientPort is not set
2021-03-29 13:33:55,686 [myid:] - INFO  [main:QuorumPeerConfig@463] - secureClientPort is not set
2021-03-29 13:33:55,686 [myid:] - INFO  [main:QuorumPeerConfig@479] - observerMasterPort is not set
2021-03-29 13:33:55,744 [myid:] - INFO  [main:QuorumPeerConfig@496] - metricsProvider.className is org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
2021-03-29 13:33:55,753 [myid:] - WARN  [main:QuorumPeerConfig@727] - No server failure will be tolerated. You need at least 3 servers.
2021-03-29 13:33:55,758 [myid:2] - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2021-03-29 13:33:55,758 [myid:2] - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 0
2021-03-29 13:33:55,758 [myid:2] - INFO  [main:DatadirCleanupManager@101] - Purge task is not scheduled.
2021-03-29 13:33:55,762 [myid:2] - INFO  [main:ManagedUtil@44] - Log4j 1.2 jmx support found and enabled.
2021-03-29 13:33:55,771 [myid:2] - INFO  [main:QuorumPeerMain@151] - Starting quorum peer
2021-03-29 13:33:55,850 [myid:2] - INFO  [main:PrometheusMetricsProvider@74] - Initializing metrics, configuration: {exportJvmInfo=true, httpPort=7000}
2021-03-29 13:33:55,850 [myid:2] - INFO  [main:PrometheusMetricsProvider@82] - Starting /metrics HTTP endpoint at port 7000 exportJvmInfo: true
2021-03-29 13:33:55,895 [myid:2] - INFO  [main:Log@169] - Logging initialized @1300ms to org.eclipse.jetty.util.log.Slf4jLog
2021-03-29 13:33:56,180 [myid:2] - INFO  [main:Server@359] - jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: 363d5f2df3a8a28de40604320230664b9c793c16; jvm 11.0.8+10
2021-03-29 13:33:56,280 [myid:2] - INFO  [main:ContextHandler@825] - Started o.e.j.s.ServletContextHandler@69c81773{/,null,AVAILABLE}
2021-03-29 13:33:56,356 [myid:2] - INFO  [main:AbstractConnector@330] - Started ServerConnector@771a660{HTTP/1.1,[http/1.1]}{0.0.0.0:7000}
2021-03-29 13:33:56,356 [myid:2] - INFO  [main:Server@399] - Started @1767ms
2021-03-29 13:33:56,368 [myid:2] - INFO  [main:ServerMetrics@62] - ServerMetrics initialized with provider org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider@7d7758be
2021-03-29 13:33:56,449 [myid:2] - INFO  [main:QuorumPeer@752] - zookeeper.quorumCnxnTimeoutMs=-1
2021-03-29 13:33:56,463 [myid:2] - WARN  [main:ContextHandler@1520] - o.e.j.s.ServletContextHandler@a82c5f1{/,null,UNAVAILABLE} contextPath ends with /*
2021-03-29 13:33:56,463 [myid:2] - WARN  [main:ContextHandler@1531] - Empty contextPath
2021-03-29 13:33:56,467 [myid:2] - INFO  [main:X509Util@77] - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
2021-03-29 13:33:56,470 [myid:2] - INFO  [main:FileTxnSnapLog@124] - zookeeper.snapshot.trust.empty : false
2021-03-29 13:33:56,473 [myid:2] - INFO  [main:QuorumPeer@1680] - Local sessions disabled
2021-03-29 13:33:56,473 [myid:2] - INFO  [main:QuorumPeer@1691] - Local session upgrading disabled
2021-03-29 13:33:56,474 [myid:2] - INFO  [main:QuorumPeer@1658] - tickTime set to 2000
2021-03-29 13:33:56,474 [myid:2] - INFO  [main:QuorumPeer@1702] - minSessionTimeout set to 4000
2021-03-29 13:33:56,474 [myid:2] - INFO  [main:QuorumPeer@1713] - maxSessionTimeout set to 40000
2021-03-29 13:33:56,497 [myid:2] - INFO  [main:QuorumPeer@1738] - initLimit set to 10
2021-03-29 13:33:56,497 [myid:2] - INFO  [main:QuorumPeer@1920] - syncLimit set to 2
2021-03-29 13:33:56,497 [myid:2] - INFO  [main:QuorumPeer@1935] - connectToLearnerMasterLimit set to 0
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] - 
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -   ______                  _                                          
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -  |___  /                 | |                                         
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -     / /    ___     ___   | | __   ___    ___   _ __     ___   _ __   
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -    / /    / _ \   / _ \  | |/ /  / _ \  / _ \ | '_ \   / _ \ | '__|
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -   / /__  | (_) | | (_) | |   <  |  __/ |  __/ | |_) | |  __/ | |    
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -  /_____|  \___/   \___/  |_|\_\  \___|  \___| | .__/   \___| |_|
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -                                               | |                     
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -                                               |_|                     
2021-03-29 13:33:56,554 [myid:2] - INFO  [main:ZookeeperBanner@42] - 
2021-03-29 13:33:56,555 [myid:2] - INFO  [main:Environment@98] - Server environment:zookeeper.version=3.6.1--104dcb3e3fb464b30c5186d229e00af9f332524b, built on 04/21/2020 15:01 GMT
2021-03-29 13:33:56,555 [myid:2] - INFO  [main:Environment@98] - Server environment:host.name=deployment-zookeeper-1.deployment-zookeeper-headless.objectstore-large.svc.cluster.local
2021-03-29 13:33:56,555 [myid:2] - INFO  [main:Environment@98] - Server environment:java.version=11.0.8
2021-03-29 13:33:56,555 [myid:2] - INFO  [main:Environment@98] - Server environment:java.vendor=N/A
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:java.home=/usr/local/openjdk-11
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:java.class.path=/apache-zookeeper-3.6.1-bin/bin/../zookeeper-server/target/classes:/apache-zookeeper-3.6.1-bin/bin/../build/classes:/apache-zookeeper-3.6.1-bin/bin/../zookeeper-server/target/lib/*.jar:/apache-zookeeper-3.6.1-bin/bin/../build/lib/*.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/zookeeper-prometheus-metrics-3.6.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/zookeeper-jute-3.6.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/zookeeper-3.6.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/snappy-java-1.1.7.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/slf4j-log4j12-1.7.25.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/slf4j-api-1.7.25.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient_servlet-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient_hotspot-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient_common-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-transport-native-unix-common-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-transport-native-epoll-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-transport-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-resolver-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-handler-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-common-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-codec-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-buffer-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/metrics-core-3.2.5.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/log4j-1.2.17.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/json-simple-1.1.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jline-2.11.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-util-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-servlet-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-server-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-security-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-io-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-http-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/javax.servlet-api-3.1.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jackson-databind-2.10.3.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jackson-core-2.10.3.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jackson-annotations-2.10.3.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/commons-lang-2.6.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/commons-cli-1.2.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/audience-annotations-0.5.0.jar:/apache-zookeeper-3.6.1-bin/bin/../zookeeper-*.jar:/apache-zookeeper-3.6.1-bin/bin/../zookeeper-server/src/main/resources/lib/*.jar:/data/conf:
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:java.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:java.io.tmpdir=/tmp
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:java.compiler=<NA>
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:os.name=Linux
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:os.arch=amd64
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:os.version=4.15.0-76-generic
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:user.name=root
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:user.home=/root
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:Environment@98] - Server environment:user.dir=/apache-zookeeper-3.6.1-bin
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:Environment@98] - Server environment:os.memory.free=7MB
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:Environment@98] - Server environment:os.memory.max=966MB
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:Environment@98] - Server environment:os.memory.total=14MB
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:ZooKeeperServer@128] - zookeeper.enableEagerACLCheck = false
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:ZooKeeperServer@132] - zookeeper.skipACL=="yes", ACL checks will be skipped
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:ZooKeeperServer@136] - zookeeper.digest.enabled = true
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:ZooKeeperServer@140] - zookeeper.closeSessionTxn.enabled = true
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:ZooKeeperServer@1434] - zookeeper.flushDelay=0
2021-03-29 13:33:56,558 [myid:2] - INFO  [main:ZooKeeperServer@1443] - zookeeper.maxWriteQueuePollTime=0
2021-03-29 13:33:56,558 [myid:2] - INFO  [main:ZooKeeperServer@1452] - zookeeper.maxBatchSize=1000
2021-03-29 13:33:56,558 [myid:2] - INFO  [main:ZooKeeperServer@241] - zookeeper.intBufferStartingSizeBytes = 1024
2021-03-29 13:33:56,563 [myid:2] - INFO  [main:WatchManagerFactory@42] - Using org.apache.zookeeper.server.watch.WatchManager as watch manager
2021-03-29 13:33:56,563 [myid:2] - INFO  [main:WatchManagerFactory@42] - Using org.apache.zookeeper.server.watch.WatchManager as watch manager
2021-03-29 13:33:56,565 [myid:2] - INFO  [main:ZKDatabase@132] - zookeeper.snapshotSizeFactor = 0.33
2021-03-29 13:33:56,565 [myid:2] - INFO  [main:ZKDatabase@152] - zookeeper.commitLogCount=500
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@2001] - Using insecure (non-TLS) quorum communication
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@2007] - Port unification disabled
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@174] - multiAddress.enabled set to false
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@199] - multiAddress.reachabilityCheckEnabled set to true
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@186] - multiAddress.reachabilityCheckTimeoutMs set to 1000
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@2461] - QuorumPeer communication is not secured! (SASL auth disabled)
2021-03-29 13:33:56,579 [myid:2] - INFO  [main:QuorumPeer@2486] - quorum.cnxn.threads.size set to 20
2021-03-29 13:33:56,585 [myid:2] - INFO  [main:AbstractConnector@380] - Stopped ServerConnector@771a660{HTTP/1.1,[http/1.1]}{0.0.0.0:7000}
2021-03-29 13:33:56,587 [myid:2] - INFO  [main:ContextHandler@1016] - Stopped o.e.j.s.ServletContextHandler@69c81773{/,null,UNAVAILABLE}
2021-03-29 13:33:56,645 [myid:2] - ERROR [main:QuorumPeerMain@113] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: My id 2 not in the peer list
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1073)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
2021-03-29 13:33:56,647 [myid:2] - INFO  [main:ZKAuditProvider@42] - ZooKeeper audit is disabled.
2021-03-29 13:33:56,649 [myid:2] - ERROR [main:ServiceUtils@42] - Exiting JVM with code 1

Zookeeper cluster CRD description

kubectl describe zookeepercluster deployment-zookeeper
Name:         deployment-zookeeper
Namespace:    objectstore-large
Labels:       app=zookeeper
              component=zk
              release=deployment
Annotations:  <none>
API Version:  zookeeper.pravega.io/v1beta1
Kind:         ZookeeperCluster
Metadata:
  Creation Timestamp:  2021-03-29T13:31:16Z
  Finalizers:
    cleanUpZookeeperPVC
  Generation:  3
    Manager:      objectscale-operator
    Operation:    Update
    Time:         2021-03-29T13:31:17Z
    API Version:  zookeeper.pravega.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
      f:spec:
        f:ports:
      f:status:
        .:
        f:conditions:
        f:externalClientEndpoint:
        f:internalClientEndpoint:
        f:members:
          .:
          f:ready:
          f:unready:
        f:readyReplicas:
        f:replicas:
    Manager:    zookeeper-operator
    Operation:  Update
    Time:       2021-03-29T14:50:03Z
  Resource Version:        10195029
  Self Link:               /apis/zookeeper.pravega.io/v1beta1/namespaces/objectstore-large/zookeeperclusters/deployment-zookeeper
  UID:                     903303a2-80e1-47bf-94a4-a4cf8cae3f3f
Spec:
  Admin Server Service:
  Client Service:
  Config:
    Auto Purge Purge Interval:     1
    Auto Purge Snap Retain Count:  3
    Commit Log Count:              500
    Global Outstanding Limit:      1000
    Init Limit:                    10
    Max Client Cnxns:              60
    Max Session Timeout:           40000
    Min Session Timeout:           4000
    Pre Alloc Size:                65536
    Snap Count:                    10000
    Snap Size Limit In Kb:         4194304
    Sync Limit:                    2
    Tick Time:                     2000
  Headless Service:
  Image:
    Pull Policy:  IfNotPresent
    Repository:   pravega/zookeeper
    Tag:          0.2.9
  Labels:
    App:        zookeeper
    Component:  zk
    Release:    deployment
  Persistence:
    Reclaim Policy:  Delete
    Spec:
      Access Modes:
        ReadWriteOnce
      Resources:
        Requests:
          Storage:         20Gi
      Storage Class Name:  csi-baremetal-sc-hddlvg
  Pod:
    Affinity:
      Pod Anti Affinity:
        Preferred During Scheduling Ignored During Execution:
          Pod Affinity Term:
            Label Selector:
              Match Expressions:
                Key:       app
                Operator:  In
                Values:
                  deployment-zookeeper
            Topology Key:  kubernetes.io/hostname
          Weight:          20
    Labels:
      App:      deployment-zookeeper
      Release:  deployment-zookeeper
    Resources:
      Limits:
        Cpu:     1
        Memory:  256M
      Requests:
        Cpu:                           500m
        Memory:                        128M
    Service Account Name:              default
    Termination Grace Period Seconds:  30
  Ports:
    Container Port:  9277
    Name:            client
    Container Port:  2888
    Name:            quorum
    Container Port:  3888
    Name:            leader-election
    Container Port:  7000
    Name:            metrics
    Container Port:  8080
    Name:            admin-server
  Probes:            <nil>
  Replicas:          3
  Storage Type:      persistence
Status:
  Conditions:
    Status:                  False
    Type:                    Upgrading
    Status:                  False
    Type:                    PodsReady
    Status:                  False
    Type:                    Error
  External Client Endpoint:  N/A
  Internal Client Endpoint:  10.102.68.149:9277
  Members:
    Ready:
      deployment-zookeeper-0
    Unready:
      deployment-zookeeper-1
  Ready Replicas:  1
  Replicas:        2
Events:            <none>

Previously we used ZK v0.2.7 and there was no this issue. Also I tried the fix described in issue #259 , but it didn't helped.

Importance

Blocker issue. We need some fixes related to 0.2.9 version (https://github.com/pravega/zookeeper-operator/issues/257), so upgrade is required.

Location

(Where is the piece of code, package, or document affected by this issue?)

Suggestions for an improvement

(How do you suggest to fix or proceed with this issue?)

priyavj08 commented 3 years ago

@BassinD I am hit by this issue for quite some time, no resolution so far. This is a critical issue for me as well. It happens intermittently even in our production environment.

from what I understand, this failure can happen when promotion of node from observer to participant fails in ZookeeperReady.sh which is run as part of Readiness probe. For me sometimes, I get the error as below

++ echo ruok ++ nc 127.0.0.1 2181 (UNKNOWN) [127.0.0.1] 2181 (?) : Connection refused

This could mean ZK server is not ready to accept connections? I was wondering, if by adding an initial delay to "readinessprobe" , so "ruok" requests are sent after a delay (hopefully ZK server will be started and running by then)

I had requested for a fix to make these probes configurable #275 . Wonder when next release of ZK operator will be so we can get these fixes?

priyavj08 commented 3 years ago

hi, when will this fix be available?

thanks

anishakj commented 3 years ago

hi, when will this fix be available?

thanks

@priyavj08 The fix is available in master, will be part of the next release. If you want to use the build now, you can git clone the code from master and build using make build-zk-image and use that image for current testing

priyavj08 commented 3 years ago

thanks @anishakj

any ETA on next release so I can plan?

thanks

anishakj commented 3 years ago

thanks @anishakj

any ETA on next release so I can plan?

thanks

@priyavj08 We are planning to do release sometimes this week.

priyavj08 commented 3 years ago

thanks for your continued support @anishakj @amuraru

I pulled the latest code from master, built ZK image and ran a continued install/uninstall tests, about 20 odd iterations it was ok, it failed at the 25th iteration. This issue still exists, unfortunately it happens most of the time in production (murphy's law coming into play)

When I was able to exec in to the failure pod-container I found

cat /data/myid 2

cat /data/conf/zoo.cfg 4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok dataDir=/data standaloneEnabled=false reconfigEnabled=true skipACL=yes metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true initLimit=10 syncLimit=2 tickTime=2000 globalOutstandingLimit=1000 preAllocSize=65536 snapCount=10000 commitLogCount=500 snapSizeLimitInKb=4194304 maxCnxns=0 maxClientCnxns=60 minSessionTimeout=4000 maxSessionTimeout=40000 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 quorumListenOnAllIPs=false admin.serverPort=8080 dynamicConfigFile=/data/conf/zoo.cfg.dynamic.1000000b4

cat /data/conf/zoo.cfg.dynamic.1000000b4 server.1=fed-kafka-affirmedzk-0.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local:2888:3888:participant;0.0.0.0:21

seems to be some timing issue, please make this part of the code in zookeeperStart.sh more reliable

thanks

anishakj commented 3 years ago

thanks for your continued support @anishakj @amuraru

I pulled the latest code from master, built ZK image and ran a continued install/uninstall tests, about 20 odd iterations it was ok, it failed at the 25th iteration. This issue still exists, unfortunately it happens most of the time in production (murphy's law coming into play)

When I was able to exec in to the failure pod-container I found

cat /data/myid 2

cat /data/conf/zoo.cfg 4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok dataDir=/data standaloneEnabled=false reconfigEnabled=true skipACL=yes metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true initLimit=10 syncLimit=2 tickTime=2000 globalOutstandingLimit=1000 preAllocSize=65536 snapCount=10000 commitLogCount=500 snapSizeLimitInKb=4194304 maxCnxns=0 maxClientCnxns=60 minSessionTimeout=4000 maxSessionTimeout=40000 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 quorumListenOnAllIPs=false admin.serverPort=8080 dynamicConfigFile=/data/conf/zoo.cfg.dynamic.1000000b4

cat /data/conf/zoo.cfg.dynamic.1000000b4 server.1=fed-kafka-affirmedzk-0.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local:2888:3888:participant;0.0.0.0:21

seems to be some timing issue, please make this part of the code in zookeeperStart.sh more reliable

thanks

@priyavj08 please provide the logs from zookeeper-1 before it performs a restart

pbelgundi commented 3 years ago

Reopening issue since problem is not fixed still.

priyavj08 commented 3 years ago

I am looking to reproduce this to get the first log from zk-1 pod when it crashes but in the mean time, here is the log from ZK-1 pod ZK1-log.txt

also, here is the output of describe from pod ZK-0, see connection refused from the probes

kc describe pod -n fed-kafka fed-kafka-affirmedzk-0 Name: fed-kafka-affirmedzk-0 Namespace: fed-kafka Priority: 0 Node: priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa/10.163.66.203 Start Time: Tue, 20 Apr 2021 06:16:05 +0000 Labels: app=fed-kafka-affirmedzk controller-revision-hash=fed-kafka-affirmedzk-545b7fbd4 kind=ZookeeperMember release=fed-kafka-affirmedzk statefulset.kubernetes.io/pod-name=fed-kafka-affirmedzk-0 Annotations: cni.projectcalico.org/podIP: 192.168.173.46/32 cni.projectcalico.org/podIPs: 192.168.173.46/32,fde6:7f0d:5c6c:ad36:56d7:55a8:2f31:c652/128 kubernetes.io/psp: permissive-network Status: Running IP: 192.168.173.46 IPs: IP: 192.168.173.46 Controlled By: StatefulSet/fed-kafka-affirmedzk Containers: zookeeper: Container ID: docker://a9b7dc72376f2849148f313ad4d34d55bc51c9dce6ec9d66e6c8ec62268f53b2 Image: cnreg-dev:5000/priya_vijaya/affirmed/zookeeper:0.2.12 Image ID: docker-pullable://cnreg-dev:5000/priya_vijaya/affirmed/zookeeper@sha256:2207ced4485ad175e6dc1ece88d44a43238db77b20e6fee543decd4d29f500e6 Ports: 2181/TCP, 2888/TCP, 3888/TCP, 7000/TCP, 8080/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP Command: /usr/local/bin/zookeeperStart.sh State: Running Started: Tue, 20 Apr 2021 06:16:06 +0000 Ready: True Restart Count: 0 Limits: cpu: 100m memory: 1536Mi Requests: cpu: 10m memory: 1Gi Liveness: exec [zookeeperLive.sh] delay=10s timeout=10s period=10s #success=1 #failure=3 Readiness: exec [zookeeperReady.sh] delay=10s timeout=10s period=10s #success=1 #failure=3 Environment: ENVOY_SIDECAR_STATUS: (v1:metadata.annotations['sidecar.istio.io/status']) Mounts: /conf from conf (rw) /data from data (rw) /var/run/secrets/kubernetes.io/serviceaccount from zookeeper-sa-token-lfl4l (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: data-fed-kafka-affirmedzk-0 ReadOnly: false conf: Type: ConfigMap (a volume populated by a ConfigMap) Name: fed-kafka-affirmedzk-configmap Optional: false zookeeper-sa-token-lfl4l: Type: Secret (a volume populated by a Secret) SecretName: zookeeper-sa-token-lfl4l Optional: false QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message


Normal Scheduled 5m44s default-scheduler Successfully assigned fed-kafka/fed-kafka-affirmedzk-0 to priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Normal Pulled 5m43s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Container image "cnreg-dev:5000/priya_vijaya/affirmed/zookeeper:0.2.12" already present on machine Normal Created 5m43s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Created container zookeeper Normal Started 5m43s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Started container zookeeper Warning Unhealthy 5m31s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Liveness probe failed: + source /conf/env.sh ++ DOMAIN=fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local ++ QUORUM_PORT=2888 ++ LEADER_PORT=3888 ++ CLIENT_HOST=fed-kafka-affirmedzk-client ++ CLIENT_PORT=2181 ++ ADMIN_SERVER_HOST=fed-kafka-affirmedzk-admin-server ++ ADMIN_SERVER_PORT=8080 ++ CLUSTER_NAME=fed-kafka-affirmedzk ++ CLUSTER_SIZE=3 ++ nc 127.0.0.1 2181 ++ echo ruok (UNKNOWN) [127.0.0.1] 2181 (?) : Connection refused

priyavj08 commented 3 years ago

in my recent tests of repeated install/uninstall after 18th iteration, it got in to a bad state.

for some reason ZK-1 failed to join the ensemble and ZK-2 is in crashloopback state (but not the error .my id 2 is missing)

attaching all the logs

failure-zk0.log failure-zk1.log failure-zk2.log

attaching the pod describe output

desc-output.log

anishakj commented 3 years ago

in my recent tests of repeated install/uninstall after 18th iteration, it got in to a bad state.

for some reason ZK-1 failed to join the ensemble and ZK-2 is in crashloopback state (but not the error .my id 2 is missing)

attaching all the logs

failure-zk0.log failure-zk1.log failure-zk2.log

attaching the pod describe output

desc-output.log

@priyavj08 , Could you please confirm base zookeeper image version used is 3.6.1

anishakj commented 3 years ago

@priyavj08 From the failure-zk1.log, i could see connectivity issue of second node with the first one.

2021-04-20 10:30:23,206 [myid:2] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1395] - Connection broken for id 1, my id = 2
java.io.EOFException
    at java.base/java.io.DataInputStream.readInt(Unknown Source)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1383)
2021-04-20 10:30:23,208 [myid:2] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1401] - Interrupting SendWorker thread from RecvWorker. sid: 1. myId: 2
2021-04-20 10:30:23,209 [myid:2] - WARN  [SendWorker:1:QuorumCnxManager$SendWorker@1281] - Interrupted while waiting for message on queue
java.lang.InterruptedException
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown Source)
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
    at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1446)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:98)
    at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1270)
2021-04-20 10:30:23,210 [myid:2] - WARN  [SendWorker:1:QuorumCnxManager$SendWorker@1293] - Send worker leaving thread id 1 my id = 2
priyavj08 commented 3 years ago

Ran this inside the ZK pod, I am using build 3.6.1

Zookeeper version: 3.6.1--104dcb3e3fb464b30c5186d229e00af9f332524b, built on 04/21/2020 15:01 GMT

priyavj08 commented 3 years ago

with the latest zk image and setting InitialDelayseconds to "60" for readiness probe, I was able to run install/uninstall tests continuously, tried 2 sets of 30 iterations. Haven't seen pod1/pod2 crashing with error "My id 2 not in the peer list"

question: will setting 60 seconds as Initiadelay for readiness probe have any impact on the functionality of the ZK ensemble?

another concern, during the first set of tests I saw the other issue though where ZK server in pod0 wasn't running and this caused issue in ZK pod 1 2021-04-20 11:34:50,819 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@373] - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)

anishakj commented 3 years ago

with the latest zk image and setting InitialDelayseconds to "60" for readiness probe, I was able to run install/uninstall tests continuously, tried 2 sets of 30 iterations. Haven't seen pod1/pod2 crashing with error "My id 2 not in the peer list"

question: will setting 60 seconds as Initiadelay for readiness probe have any impact on the functionality of the ZK ensemble?

another concern, during the first set of tests I saw the other issue though where ZK server in pod0 wasn't running and this caused issue in ZK pod 1 2021-04-20 11:34:50,819 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@373] - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)

In zookeeper cluster deployment, only after the first pod has started and running, second pod will start. Is it not happening in your case?

Also, initialDelaySeconds will ensure that probes will start only after that time. Other than that it wont have any effect.

priyavj08 commented 3 years ago

looks like ZK-0 pod was in running state, though the ZK server wasn't running and logs shows that error , similarly ZK-1 pod was in running state with errors . you can see pods status in the desc-output I had attached

anishakj commented 3 years ago

looks like ZK-0 pod was in running state, though the ZK server wasn't running and logs shows that error , similarly ZK-1 pod was in running state with errors . you can see pods status in the desc-output I had attached

Are you seeing these errors with increased initialdelay?

priyavj08 commented 3 years ago

@anishakj yes it happened with the set of tests I did by adding initialdelay but this is also not always seen.

anishakj commented 3 years ago

@anishakj yes it happened with the set of tests I did by adding initialdelay but this is also not always seen.

By deafult initialDelaySeconds is 10. Could you please set to 30 and give it a try

priyavj08 commented 3 years ago

ZK-1 pod got into crashloopackoff couple of times but eventually worked, here is the log. This is the new fix right?

** server can't find fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local: NXDOMAIN

fed-kafka-affirmedzk-0 1/1 Running 0 122m fed-kafka-affirmedzk-1 1/1 Running 2 121m fed-kafka-affirmedzk-2 1/1 Running 0 117m fed-kafka-affirmedzk-5bd966c46d-5j6hh 1/1 Running 0 122m

anishakj commented 3 years ago

ZK-1 pod got into crashloopackoff couple of times but eventually worked, here is the log. This is the new fix right?

  • source /conf/env.sh ++ DOMAIN=fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local ++ QUORUM_PORT=2888 ++ LEADER_PORT=3888 ++ CLIENT_HOST=fed-kafka-affirmedzk-client ++ CLIENT_PORT=2181 ++ ADMIN_SERVER_HOST=fed-kafka-affirmedzk-admin-server ++ ADMIN_SERVER_PORT=8080 ++ CLUSTER_NAME=fed-kafka-affirmedzk ++ CLUSTER_SIZE=3
  • source /usr/local/bin/zookeeperFunctions.sh ++ set -ex ++ hostname -s
  • HOST=fed-kafka-affirmedzk-1
  • DATA_DIR=/data
  • MYID_FILE=/data/myid
  • LOG4J_CONF=/conf/log4j-quiet.properties
  • DYNCONFIG=/data/zoo.cfg.dynamic
  • STATIC_CONFIG=/data/conf/zoo.cfg
  • [[ fed-kafka-affirmedzk-1 =~ (.*)-([0-9]+)$ ]]
  • NAME=fed-kafka-affirmedzk
  • ORD=1
  • MYID=2
  • WRITE_CONFIGURATION=true
  • REGISTER_NODE=true
  • ONDISK_MYID_CONFIG=false
  • ONDISK_DYN_CONFIG=false
  • '[' -f /data/myid ']' ++ cat /data/myid
  • EXISTING_ID=2
  • [[ 2 == \2 ]]
  • [[ -f /data/conf/zoo.cfg ]]
  • ONDISK_MYID_CONFIG=true
  • '[' -f /data/zoo.cfg.dynamic ']'
  • set +e
  • [[ -n '' ]]
  • set -e
  • set +e
  • nslookup fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
  • [[ 1 -eq 0 ]]
  • nslookup fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
  • grep -q 'server can'''t find fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local' Server: 10.96.0.10 Address: 10.96.0.10#53

** server can't find fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local: NXDOMAIN

  • echo 'there is no active ensemble'
  • ACTIVE_ENSEMBLE=false
  • [[ true == true ]]
  • [[ false == true ]]
  • WRITE_CONFIGURATION=true
  • [[ false == false ]]
  • REGISTER_NODE=false
  • [[ true == true ]]
  • echo 'Writing myid: 2 to: /data/myid.'
  • echo 2
  • [[ 2 -eq 1 ]]
  • [[ false == true ]]
  • ZOOCFGDIR=/data/conf
  • export ZOOCFGDIR
  • echo Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration
  • [[ ! -d /data/conf ]]
  • echo Copying the /conf/zoo.cfg contents except the dynamic config file during restart ++ head -n -1 /conf/zoo.cfg there is no active ensemble Writing myid: 2 to: /data/myid. Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration Copying the /conf/zoo.cfg contents except the dynamic config file during restart ++ tail -n 1 /data/conf/zoo.cfg
  • echo -e '4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok dataDir=/data standaloneEnabled=false reconfigEnabled=true skipACL=yes metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true initLimit=10 syncLimit=2 tickTime=2000 globalOutstandingLimit=1000 preAllocSize=65536 snapCount=10000 commitLogCount=500 snapSizeLimitInKb=4194304 maxCnxns=0 maxClientCnxns=60 minSessionTimeout=4000 maxSessionTimeout=40000 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 quorumListenOnAllIPs=false admin.serverPort=8080\ndynamicConfigFile=/data/zoo.cfg.dynamic'
  • cp -f /conf/log4j.properties /data/conf
  • cp -f /conf/log4j-quiet.properties /data/conf
  • cp -f /conf/env.sh /data/conf
  • '[' -f /data/zoo.cfg.dynamic ']'
  • echo 'Node failed to register!' **Node failed to register!
  • exit 1**

fed-kafka-affirmedzk-0 1/1 Running 0 122m fed-kafka-affirmedzk-1 1/1 Running 2 121m fed-kafka-affirmedzk-2 1/1 Running 0 117m fed-kafka-affirmedzk-5bd966c46d-5j6hh 1/1 Running 0 122m

yes, with this fix while on restart pod will become ready. If it is working for you after couple of pod restarts, could you please close the issue.

priyavj08 commented 3 years ago

so far it is working fine. please close this bug

anishakj commented 3 years ago

so far it is working fine. please close this bug

@priyavj08 Thanks for the confirmation. Please feel free to reopen if you see the same issue again.

iampranabroy commented 2 years ago

Did anyone try setting initialDelaySeconds when using the zookeeper helm chart? I don't see that option in pravega/zookeeper-operator chart v0.2.12 or later @priyavj08 @anishakj

alexangas commented 2 years ago

@iampranabroy Does this work for you: https://github.com/pravega/zookeeper-operator/blob/2403ac54739d55a9e97333a837a45fea9dc0a96c/charts/zookeeper/values.yaml#L25

mmoscher commented 2 years ago

I'm facing the same issue. @anishakj even with long initialDelaySeconds (20,30,60) for liveness- and readinessProbes I'm unable to successfully deploy a zookeeper cluster with replicas > 1. Any further hint what could go wrong?

The issues stays the same: 2022-09-26 15:14:06,975 [myid:2] - ERROR [main:QuorumPeerMain@114] - Unexpected exception, exiting abnormally java.lang.RuntimeException: My id 2 not in the peer list at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1128) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91) the second replica tries to start and ends up in a CrashLoopBackOff.

Zookeeper-Operator is indirectly deployed using the Solr-Operator, version 0.2.14

iampranabroy commented 2 years ago

Hey @mmoscher - Can you please try setting only InitialDelayseconds to 60 for readiness probe probes.readiness.initialDelaySeconds as suggested above and see if it works?

@priyavj08 @anishakj - Did you try any additional changes?

mmoscher commented 2 years ago

Issue solved - at least for me, I think.

There were two unique issues (on my side) which kept me running into the error mentioned above. Reading this thread led me into a (somewhat) wrong direction, probing/testing the wrong solution.

Issue 1 - wrong network policies My deployment is blocking any egress/ingress traffic, using network policies, iff not explicitly allowed. For zookeeper internal traffic, i.e. zookeeper <-> zookeeper, I had one policy in place allowing port 2181 (client connection) only. However, for leader election this is not enough. After opening ports 2888 (quorum), 3888 (leader election), 7000 (metrics) and 8080 (admin) a fresh (new, unique) zookeeper cluster was able to bootstrap in the same namespace :facepalm: However, my solr-cloud related zookeeper cluster keeps failing. Even after deleting the zookeeper cluster resource and automatic recreation by the solr-operator.

Issue 2 - old state/wrong data on pvc My zookeeper and solr keep their pvc's after deletion and reuse them if redeployed (due to dataStorage.persistent.reclaimPolicy: "Retain"). After dropping all related pvcs and boostrapping a fresh solrcloud cluster the zookeeper cluster bootstraps successfully :thinking: . Furthermore, this explains why I was able to successfully deploy a zookeeper cluster on my local dev environment, as we do not use persistent storage during development. It seems as zookeeper stores some data about his ID/State on disk, which could lead to later failures if corrupt and not cleaned up correctly. Unfortunately, I did not inspect the volumes and its data. Maybe @priyavj08 and/or @anishakj have a clue/guess on this.

TL;DR: after setting correct network polices and cleaning up old (corrupt) data, i.e. all zookeeper related pvcs, I was able to successfully bootstrap a zookeeper cluster (4-times in row, same namespace and same k8s-cluster) . No need to adjust liveness- or readinessProbe's.

@iampranabroy hopefully these tips help you!

iampranabroy commented 2 years ago

Thanks much @mmoscher Is it possible for you to share the updated network policy that you applied for zookeeper <-> zookeeper internal traffic?

mmoscher commented 2 years ago

@iampranabroy sure, here we go:

We've two policies in place related to solr/zookeeper. One (a) to allow traffic between the zookeeper members itself (z<->z) and another one (b) to allow traffic from solr to the zookeeper (s->z) pods. Note: we do block any egress traffic from all pods by default and are following the "efault-deny-all-egress-traffic" principle. If you're doing it vice-versa, eg. blocking ingress traffic, you need to change the policies accordingly. Furthermore, the solr-instances accessing your zookeeper pods needs to have the custom label allow-zookeeper-access: "true" set.

a)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-zookeeper-access-zookeeper
spec:
  egress:
  - ports:
    - port: 2181
      protocol: TCP
    - port: 2888
      protocol: TCP
    - port: 3888
      protocol: TCP
    - port: 7000
      protocol: TCP
    - port: 8080
      protocol: TCP
    to:
    - podSelector:
        matchLabels:
          kind: ZookeeperMember
          technology: zookeeper
  podSelector:
    matchLabels:
      kind: ZookeeperMember
      technology: zookeeper
  policyTypes:
  - Egress

b)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-solr-access-zookeeper
spec:
  egress:
  - ports:
    - port: 2181
      protocol: TCP
    - port: 7000
      protocol: TCP
    - port: 8080
      protocol: TCP
    to:
    - podSelector:
        matchLabels:
          kind: ZookeeperMember
          technology: zookeeper
  podSelector:
    matchLabels:
      allow-zookeeper-access: "true"
  policyTypes:
  - Egress

Good luck :crossed_fingers:

iampranabroy commented 2 years ago

Thanks, @mmoscher for sharing the details. A good point to keep in mind about NetworkPolicy for zookeeper and solr cluster's internal communication.

In my case, the zookeeper and solr clusters are in the same namespace and they come up sometimes fine, but notice this error a few times. The error itself is not consistent, making it hard to debug.