pravega / zookeeper-operator

Kubernetes Operator for Zookeeper
Apache License 2.0
368 stars 208 forks source link

Zookeeper member pod goes into "CrashLoopBackOff" state when new cluster with n (>2) is created #331

Closed aparajita89 closed 3 years ago

aparajita89 commented 3 years ago

Description

When a CRD is created for the cluster to be brought up with spec.replicas set to n, where n > 1, pods with id > 0 go into "CrashLoopBackOff" state with the below error in zookeeper logs:

2021-05-20 09:53:25,101 [myid:2] - ERROR [main:QuorumPeerMain@113] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: My id 2 not in the peer list
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1073)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
2021-05-20 09:53:25,103 [myid:2] - INFO  [main:ZKAuditProvider@42] - ZooKeeper audit is disabled.

Importance

Bringing up a cluster on n nodes seems like a basic feature for an operator. Perhaps I am missing something in the configs?

Location

deploy/cr/pravega/zookeeper_v1beta1_zookeepercluster_cr.yaml

Suggestions for an improvement

  1. zoo.cfg should be populated with the list of servers which should be present in the cluster to be brought up.
  2. While adding new nodes, if number of nodes to be added is greater than 1 then add them sequentially rather than trying to add them all at once.
anishakj commented 3 years ago

@aparajita89 Could you please let us know the zookeeper-operator and cluster version used here?

aparajita89 commented 3 years ago

pravega zookeeper operator: 0.2.9 zookeeper version: 3.6.2

anishakj commented 3 years ago

pravega zookeeper operator: 0.2.9 zookeeper version: 3.6.2

Please use 0.2.10 of operator and cluster. We have fixed it in 0.2.10

aparajita89 commented 3 years ago

ok, will try that. could you also share, what was the fix? just curious

anishakj commented 3 years ago

ok, will try that. could you also share, what was the fix? just curious

PR https://github.com/pravega/zookeeper-operator/pull/135 contains the fix.

aparajita89 commented 3 years ago

@anishakj i tried upgrading to 0.2.10 and creating a cluster of 4 nodes. it is still failing with the same error as mentioned in the issue description.

amuraru commented 3 years ago

@aparajita89 #135 fixed an issue in the zk docker image itself and not operator. You need to make sure you upgrade to an zk image including that fix

anishakj commented 3 years ago

@aparajita89 Please let us know is the issue got solved for you?

aparajita89 commented 3 years ago

i tried upgrading to: pravega/zookeeper-operator: 0.2.10 pravega/zookeeper: 0.2.10

i'm still seeing the same error.

i tried to debug this as well. i think this is related to docker/bin/zookeeperStart.sh script. imo, REGISTER_NODE and WRITE_CONFIGURATION must always be true (consequently, the true/false checks on these can be removed entirely). also, node registration should be called before the config file is written so that the config file will contain the latest information about the cluster. this way, when a new pod is coming up, it always gets the latest configs from the existing zookeeper cluster. but perhaps i am missing something, should these checks be retained?

anishakj commented 3 years ago

@aparajita89 coukd you please share the logs from zookeeper-1 before the restart has happened?

aparajita89 commented 3 years ago

these are the last few logs which came from the previous "CrashLoopBackOff" error:

2021-05-24 08:23:13,358 [myid:2] - INFO  [main:AbstractConnector@380] - Stopped ServerConnector@70e9c95d{HTTP/1.1,[http/1.1]}{0.0.0.0:7000}
2021-05-24 08:23:13,359 [myid:2] - INFO  [main:ContextHandler@1016] - Stopped o.e.j.s.ServletContextHandler@4b520ea8{/,null,UNAVAILABLE}
2021-05-24 08:23:13,361 [myid:2] - ERROR [main:QuorumPeerMain@113] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: My id 2 not in the peer list
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1073)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
2021-05-24 08:23:13,362 [myid:2] - INFO  [main:ZKAuditProvider@42] - ZooKeeper audit is disabled.
2021-05-24 08:23:13,364 [myid:2] - ERROR [main:ServiceUtils@42] - Exiting JVM with code 1

I've recreated the CRD now to recreate the cluster. After that, these are the last few lines of the logs:

2021-05-24 08:49:23,200 [myid:2] - INFO  [main:AbstractConnector@380] - Stopped ServerConnector@70e9c95d{HTTP/1.1,[http/1.1]}{0.0.0.0:7000}
2021-05-24 08:49:23,202 [myid:2] - INFO  [main:ContextHandler@1016] - Stopped o.e.j.s.ServletContextHandler@4b520ea8{/,null,UNAVAILABLE}
2021-05-24 08:49:23,203 [myid:2] - ERROR [main:QuorumPeerMain@113] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: My id 2 not in the peer list
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1073)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
2021-05-24 08:49:23,205 [myid:2] - INFO  [main:ZKAuditProvider@42] - ZooKeeper audit is disabled.
2021-05-24 08:49:23,207 [myid:2] - ERROR [main:ServiceUtils@42] - Exiting JVM with code 1
anishakj commented 3 years ago

@ aparajita89 these are not complete logs, also can you tell us which environment you are using.

aparajita89 commented 3 years ago

this is running on a privately managed k8 cluster.

this is the complete log:

$ kubectl logs zookeeperpoc-1
+ source /conf/env.sh
++ DOMAIN=zookeeperpoc-headless.zk.svc.cluster.local
++ QUORUM_PORT=2888
++ LEADER_PORT=3888
++ CLIENT_HOST=zookeeperpoc-client
++ CLIENT_PORT=2181
++ ADMIN_SERVER_HOST=zookeeperpoc-admin-server
++ ADMIN_SERVER_PORT=8080
++ CLUSTER_NAME=zookeeperpoc
++ CLUSTER_SIZE=4
+ source /usr/local/bin/zookeeperFunctions.sh
++ set -ex
++ hostname -s
+ HOST=zookeeperpoc-1
+ DATA_DIR=/data
+ MYID_FILE=/data/myid
+ LOG4J_CONF=/conf/log4j-quiet.properties
+ DYNCONFIG=/data/zoo.cfg.dynamic
+ STATIC_CONFIG=/data/conf/zoo.cfg
+ [[ zookeeperpoc-1 =~ (.*)-([0-9]+)$ ]]
+ NAME=zookeeperpoc
+ ORD=1
+ MYID=2
+ WRITE_CONFIGURATION=true
+ REGISTER_NODE=true
+ ONDISK_MYID_CONFIG=false
+ ONDISK_DYN_CONFIG=false
+ '[' -f /data/myid ']'
++ cat /data/myid
+ EXISTING_ID=2
+ [[ 2 == \2 ]]
+ [[ -f /data/conf/zoo.cfg ]]
+ ONDISK_MYID_CONFIG=true
+ '[' -f /data/zoo.cfg.dynamic ']'
+ ONDISK_DYN_CONFIG=true
+ set +e
+ [[ -n '' ]]
+ set -e
+ set +e
+ nslookup zookeeperpoc-headless.zk.svc.cluster.local
Server:     10.96.0.10
Address:    10.96.0.10#53

** server can't find zookeeperpoc-headless.zk.svc.cluster.local: NXDOMAIN

+ [[ 1 -eq 0 ]]
+ grep -q 'server can'\''t find zookeeperpoc-headless.zk.svc.cluster.local'
+ nslookup zookeeperpoc-headless.zk.svc.cluster.local
+ echo 'there is no active ensemble'
+ ACTIVE_ENSEMBLE=false
+ [[ true == true ]]
+ [[ true == true ]]
there is no active ensemble
Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration
+ WRITE_CONFIGURATION=false
+ [[ false == false ]]
+ REGISTER_NODE=false
+ [[ false == true ]]
+ [[ false == true ]]
+ ZOOCFGDIR=/data/conf
+ export ZOOCFGDIR
+ echo Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration
+ [[ ! -d /data/conf ]]
+ echo Copying the /conf/zoo.cfg contents except the dynamic config file during restart
Copying the /conf/zoo.cfg contents except the dynamic config file during restart
++ head -n -1 /conf/zoo.cfg
++ tail -n 1 /data/conf/zoo.cfg
+ echo -e '4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok
dataDir=/data
standaloneEnabled=false
reconfigEnabled=true
skipACL=yes
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
initLimit=15
syncLimit=3
tickTime=1500
globalOutstandingLimit=1000
preAllocSize=65536
snapCount=10000
commitLogCount=500
snapSizeLimitInKb=4194304
maxCnxns=0
maxClientCnxns=60
minSessionTimeout=3000
maxSessionTimeout=30000
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
quorumListenOnAllIPs=false
admin.serverPort=8080\ndynamicConfigFile=/data/zoo.cfg.dynamic'
+ cp -f /conf/log4j.properties /data/conf
+ cp -f /conf/log4j-quiet.properties /data/conf
+ cp -f /conf/env.sh /data/conf
Starting zookeeper service
+ '[' -f /data/zoo.cfg.dynamic ']'
+ echo Starting zookeeper service
+ zkServer.sh --config /data/conf start-foreground
ZooKeeper JMX enabled by default
Using config: /data/conf/zoo.cfg
2021-05-24 08:49:22,690 [myid:] - INFO  [main:QuorumPeerConfig@173] - Reading configuration from: /data/conf/zoo.cfg
2021-05-24 08:49:22,698 [myid:] - INFO  [main:QuorumPeerConfig@450] - clientPort is not set
2021-05-24 08:49:22,698 [myid:] - INFO  [main:QuorumPeerConfig@463] - secureClientPort is not set
2021-05-24 08:49:22,698 [myid:] - INFO  [main:QuorumPeerConfig@479] - observerMasterPort is not set
2021-05-24 08:49:22,702 [myid:] - INFO  [main:QuorumPeerConfig@496] - metricsProvider.className is org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
2021-05-24 08:49:22,718 [myid:] - WARN  [main:QuorumPeerConfig@727] - No server failure will be tolerated. You need at least 3 servers.
2021-05-24 08:49:22,722 [myid:2] - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2021-05-24 08:49:22,722 [myid:2] - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 1
2021-05-24 08:49:22,726 [myid:2] - INFO  [main:ManagedUtil@44] - Log4j 1.2 jmx support found and enabled.
2021-05-24 08:49:22,731 [myid:2] - INFO  [main:QuorumPeerMain@151] - Starting quorum peer
2021-05-24 08:49:22,759 [myid:2] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@139] - Purge task started.
2021-05-24 08:49:22,772 [myid:2] - INFO  [PurgeTask:FileTxnSnapLog@124] - zookeeper.snapshot.trust.empty : false
2021-05-24 08:49:22,783 [myid:2] - INFO  [main:PrometheusMetricsProvider@74] - Initializing metrics, configuration: {exportJvmInfo=true, httpPort=7000}
2021-05-24 08:49:22,783 [myid:2] - INFO  [main:PrometheusMetricsProvider@82] - Starting /metrics HTTP endpoint at port 7000 exportJvmInfo: true
2021-05-24 08:49:22,797 [myid:2] - INFO  [PurgeTask:DatadirCleanupManager$PurgeTask@145] - Purge task completed.
2021-05-24 08:49:22,867 [myid:2] - INFO  [main:Log@169] - Logging initialized @889ms to org.eclipse.jetty.util.log.Slf4jLog
2021-05-24 08:49:23,020 [myid:2] - INFO  [main:Server@359] - jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: 363d5f2df3a8a28de40604320230664b9c793c16; jvm 11.0.8+10
2021-05-24 08:49:23,076 [myid:2] - INFO  [main:ContextHandler@825] - Started o.e.j.s.ServletContextHandler@4b520ea8{/,null,AVAILABLE}
2021-05-24 08:49:23,104 [myid:2] - INFO  [main:AbstractConnector@330] - Started ServerConnector@70e9c95d{HTTP/1.1,[http/1.1]}{0.0.0.0:7000}
2021-05-24 08:49:23,104 [myid:2] - INFO  [main:Server@399] - Started @1130ms
2021-05-24 08:49:23,120 [myid:2] - INFO  [main:ServerMetrics@62] - ServerMetrics initialized with provider org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider@7ac296f6
2021-05-24 08:49:23,142 [myid:2] - INFO  [main:QuorumPeer@752] - zookeeper.quorumCnxnTimeoutMs=-1
2021-05-24 08:49:23,156 [myid:2] - WARN  [main:ContextHandler@1520] - o.e.j.s.ServletContextHandler@79c97cb{/,null,UNAVAILABLE} contextPath ends with /*
2021-05-24 08:49:23,156 [myid:2] - WARN  [main:ContextHandler@1531] - Empty contextPath
2021-05-24 08:49:23,158 [myid:2] - INFO  [main:X509Util@77] - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
2021-05-24 08:49:23,159 [myid:2] - INFO  [main:FileTxnSnapLog@124] - zookeeper.snapshot.trust.empty : false
2021-05-24 08:49:23,159 [myid:2] - INFO  [main:QuorumPeer@1680] - Local sessions disabled
2021-05-24 08:49:23,159 [myid:2] - INFO  [main:QuorumPeer@1691] - Local session upgrading disabled
2021-05-24 08:49:23,159 [myid:2] - INFO  [main:QuorumPeer@1658] - tickTime set to 1500
2021-05-24 08:49:23,159 [myid:2] - INFO  [main:QuorumPeer@1702] - minSessionTimeout set to 3000
2021-05-24 08:49:23,160 [myid:2] - INFO  [main:QuorumPeer@1713] - maxSessionTimeout set to 30000
2021-05-24 08:49:23,160 [myid:2] - INFO  [main:QuorumPeer@1738] - initLimit set to 15
2021-05-24 08:49:23,160 [myid:2] - INFO  [main:QuorumPeer@1920] - syncLimit set to 3
2021-05-24 08:49:23,160 [myid:2] - INFO  [main:QuorumPeer@1935] - connectToLearnerMasterLimit set to 0
2021-05-24 08:49:23,169 [myid:2] - INFO  [main:ZookeeperBanner@42] - 
2021-05-24 08:49:23,169 [myid:2] - INFO  [main:ZookeeperBanner@42] -   ______                  _                                          
2021-05-24 08:49:23,169 [myid:2] - INFO  [main:ZookeeperBanner@42] -  |___  /                 | |                                         
2021-05-24 08:49:23,170 [myid:2] - INFO  [main:ZookeeperBanner@42] -     / /    ___     ___   | | __   ___    ___   _ __     ___   _ __   
2021-05-24 08:49:23,170 [myid:2] - INFO  [main:ZookeeperBanner@42] -    / /    / _ \   / _ \  | |/ /  / _ \  / _ \ | '_ \   / _ \ | '__|
2021-05-24 08:49:23,171 [myid:2] - INFO  [main:ZookeeperBanner@42] -   / /__  | (_) | | (_) | |   <  |  __/ |  __/ | |_) | |  __/ | |    
2021-05-24 08:49:23,171 [myid:2] - INFO  [main:ZookeeperBanner@42] -  /_____|  \___/   \___/  |_|\_\  \___|  \___| | .__/   \___| |_|
2021-05-24 08:49:23,171 [myid:2] - INFO  [main:ZookeeperBanner@42] -                                               | |                     
2021-05-24 08:49:23,171 [myid:2] - INFO  [main:ZookeeperBanner@42] -                                               |_|                     
2021-05-24 08:49:23,171 [myid:2] - INFO  [main:ZookeeperBanner@42] - 
2021-05-24 08:49:23,172 [myid:2] - INFO  [main:Environment@98] - Server environment:zookeeper.version=3.6.1--104dcb3e3fb464b30c5186d229e00af9f332524b, built on 04/21/2020 15:01 GMT
2021-05-24 08:49:23,172 [myid:2] - INFO  [main:Environment@98] - Server environment:host.name=zookeeperpoc-1.zookeeperpoc-headless.zk.svc.cluster.local
2021-05-24 08:49:23,172 [myid:2] - INFO  [main:Environment@98] - Server environment:java.version=11.0.8
2021-05-24 08:49:23,172 [myid:2] - INFO  [main:Environment@98] - Server environment:java.vendor=N/A
2021-05-24 08:49:23,172 [myid:2] - INFO  [main:Environment@98] - Server environment:java.home=/usr/local/openjdk-11
2021-05-24 08:49:23,172 [myid:2] - INFO  [main:Environment@98] - Server environment:java.class.path=/apache-zookeeper-3.6.1-bin/bin/../zookeeper-server/target/classes:/apache-zookeeper-3.6.1-bin/bin/../build/classes:/apache-zookeeper-3.6.1-bin/bin/../zookeeper-server/target/lib/*.jar:/apache-zookeeper-3.6.1-bin/bin/../build/lib/*.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/zookeeper-prometheus-metrics-3.6.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/zookeeper-jute-3.6.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/zookeeper-3.6.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/snappy-java-1.1.7.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/slf4j-log4j12-1.7.25.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/slf4j-api-1.7.25.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient_servlet-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient_hotspot-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient_common-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-transport-native-unix-common-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-transport-native-epoll-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-transport-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-resolver-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-handler-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-common-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-codec-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-buffer-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/metrics-core-3.2.5.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/log4j-1.2.17.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/json-simple-1.1.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jline-2.11.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-util-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-servlet-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-server-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-security-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-io-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-http-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/javax.servlet-api-3.1.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jackson-databind-2.10.3.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jackson-core-2.10.3.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jackson-annotations-2.10.3.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/commons-lang-2.6.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/commons-cli-1.2.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/audience-annotations-0.5.0.jar:/apache-zookeeper-3.6.1-bin/bin/../zookeeper-*.jar:/apache-zookeeper-3.6.1-bin/bin/../zookeeper-server/src/main/resources/lib/*.jar:/data/conf:
2021-05-24 08:49:23,173 [myid:2] - INFO  [main:Environment@98] - Server environment:java.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
2021-05-24 08:49:23,173 [myid:2] - INFO  [main:Environment@98] - Server environment:java.io.tmpdir=/tmp
2021-05-24 08:49:23,173 [myid:2] - INFO  [main:Environment@98] - Server environment:java.compiler=<NA>
2021-05-24 08:49:23,173 [myid:2] - INFO  [main:Environment@98] - Server environment:os.name=Linux
2021-05-24 08:49:23,173 [myid:2] - INFO  [main:Environment@98] - Server environment:os.arch=amd64
2021-05-24 08:49:23,173 [myid:2] - INFO  [main:Environment@98] - Server environment:os.version=5.10.0-0.bpo.3-cloud-amd64
2021-05-24 08:49:23,174 [myid:2] - INFO  [main:Environment@98] - Server environment:user.name=root
2021-05-24 08:49:23,174 [myid:2] - INFO  [main:Environment@98] - Server environment:user.home=/root
2021-05-24 08:49:23,174 [myid:2] - INFO  [main:Environment@98] - Server environment:user.dir=/apache-zookeeper-3.6.1-bin
2021-05-24 08:49:23,176 [myid:2] - INFO  [main:Environment@98] - Server environment:os.memory.free=881MB
2021-05-24 08:49:23,176 [myid:2] - INFO  [main:Environment@98] - Server environment:os.memory.max=966MB
2021-05-24 08:49:23,176 [myid:2] - INFO  [main:Environment@98] - Server environment:os.memory.total=966MB
2021-05-24 08:49:23,176 [myid:2] - INFO  [main:ZooKeeperServer@128] - zookeeper.enableEagerACLCheck = false
2021-05-24 08:49:23,176 [myid:2] - INFO  [main:ZooKeeperServer@132] - zookeeper.skipACL=="yes", ACL checks will be skipped
2021-05-24 08:49:23,177 [myid:2] - INFO  [main:ZooKeeperServer@136] - zookeeper.digest.enabled = true
2021-05-24 08:49:23,177 [myid:2] - INFO  [main:ZooKeeperServer@140] - zookeeper.closeSessionTxn.enabled = true
2021-05-24 08:49:23,177 [myid:2] - INFO  [main:ZooKeeperServer@1434] - zookeeper.flushDelay=0
2021-05-24 08:49:23,177 [myid:2] - INFO  [main:ZooKeeperServer@1443] - zookeeper.maxWriteQueuePollTime=0
2021-05-24 08:49:23,177 [myid:2] - INFO  [main:ZooKeeperServer@1452] - zookeeper.maxBatchSize=1000
2021-05-24 08:49:23,177 [myid:2] - INFO  [main:ZooKeeperServer@241] - zookeeper.intBufferStartingSizeBytes = 1024
2021-05-24 08:49:23,180 [myid:2] - INFO  [main:WatchManagerFactory@42] - Using org.apache.zookeeper.server.watch.WatchManager as watch manager
2021-05-24 08:49:23,180 [myid:2] - INFO  [main:WatchManagerFactory@42] - Using org.apache.zookeeper.server.watch.WatchManager as watch manager
2021-05-24 08:49:23,182 [myid:2] - INFO  [main:ZKDatabase@132] - zookeeper.snapshotSizeFactor = 0.33
2021-05-24 08:49:23,182 [myid:2] - INFO  [main:ZKDatabase@152] - zookeeper.commitLogCount=500
2021-05-24 08:49:23,196 [myid:2] - INFO  [main:QuorumPeer@2001] - Using insecure (non-TLS) quorum communication
2021-05-24 08:49:23,196 [myid:2] - INFO  [main:QuorumPeer@2007] - Port unification disabled
2021-05-24 08:49:23,196 [myid:2] - INFO  [main:QuorumPeer@174] - multiAddress.enabled set to false
2021-05-24 08:49:23,196 [myid:2] - INFO  [main:QuorumPeer@199] - multiAddress.reachabilityCheckEnabled set to true
2021-05-24 08:49:23,196 [myid:2] - INFO  [main:QuorumPeer@186] - multiAddress.reachabilityCheckTimeoutMs set to 1000
2021-05-24 08:49:23,196 [myid:2] - INFO  [main:QuorumPeer@2461] - QuorumPeer communication is not secured! (SASL auth disabled)
2021-05-24 08:49:23,196 [myid:2] - INFO  [main:QuorumPeer@2486] - quorum.cnxn.threads.size set to 20
2021-05-24 08:49:23,200 [myid:2] - INFO  [main:AbstractConnector@380] - Stopped ServerConnector@70e9c95d{HTTP/1.1,[http/1.1]}{0.0.0.0:7000}
2021-05-24 08:49:23,202 [myid:2] - INFO  [main:ContextHandler@1016] - Stopped o.e.j.s.ServletContextHandler@4b520ea8{/,null,UNAVAILABLE}
2021-05-24 08:49:23,203 [myid:2] - ERROR [main:QuorumPeerMain@113] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: My id 2 not in the peer list
    at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1073)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
2021-05-24 08:49:23,205 [myid:2] - INFO  [main:ZKAuditProvider@42] - ZooKeeper audit is disabled.
2021-05-24 08:49:23,207 [myid:2] - ERROR [main:ServiceUtils@42] - Exiting JVM with code 1
anishakj commented 3 years ago

@aparajita89 it looks like /data/zoo.cfg.dynamic file is present since you have tried an upgrade. could you please uninstall zookeeper cluster and do an installation. Also let me know if nslookup zookeeperpoc-headless.zk.svc.cluster.local is resolving from the first zk pod?

aparajita89 commented 3 years ago

nslookup zookeeperpoc-headless.zk.svc.cluster.local => this is resolving to zookeeperpoc-0 which is the first pod in the cluster

i deleted the CRD again. seems like this does not delete the PVC. i manually deleted the PVC as well and then re-created the CRD. this time the cluster got created after a couple of pod restarts on zookeeperpoc-1.

aparajita89 commented 3 years ago

i tried recreating the cluster again and this time the PVC got deleted and recreated as expected. we can close this issue now. thanks for you help @anishakj .

anishakj commented 3 years ago

Closing this issue, as it is resolved