Closed BassinD closed 3 years ago
@BassinD I am hit by this issue for quite some time, no resolution so far. This is a critical issue for me as well. It happens intermittently even in our production environment.
from what I understand, this failure can happen when promotion of node from observer to participant fails in ZookeeperReady.sh which is run as part of Readiness probe. For me sometimes, I get the error as below
++ echo ruok ++ nc 127.0.0.1 2181 (UNKNOWN) [127.0.0.1] 2181 (?) : Connection refused
This could mean ZK server is not ready to accept connections? I was wondering, if by adding an initial delay to "readinessprobe" , so "ruok" requests are sent after a delay (hopefully ZK server will be started and running by then)
I had requested for a fix to make these probes configurable #275 . Wonder when next release of ZK operator will be so we can get these fixes?
hi, when will this fix be available?
thanks
hi, when will this fix be available?
thanks
@priyavj08 The fix is available in master, will be part of the next release. If you want to use the build now, you can git clone
the code from master and build using make build-zk-image
and use that image for current testing
thanks @anishakj
any ETA on next release so I can plan?
thanks
thanks @anishakj
any ETA on next release so I can plan?
thanks
@priyavj08 We are planning to do release sometimes this week.
thanks for your continued support @anishakj @amuraru
I pulled the latest code from master, built ZK image and ran a continued install/uninstall tests, about 20 odd iterations it was ok, it failed at the 25th iteration. This issue still exists, unfortunately it happens most of the time in production (murphy's law coming into play)
When I was able to exec in to the failure pod-container I found
cat /data/myid 2
cat /data/conf/zoo.cfg 4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok dataDir=/data standaloneEnabled=false reconfigEnabled=true skipACL=yes metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true initLimit=10 syncLimit=2 tickTime=2000 globalOutstandingLimit=1000 preAllocSize=65536 snapCount=10000 commitLogCount=500 snapSizeLimitInKb=4194304 maxCnxns=0 maxClientCnxns=60 minSessionTimeout=4000 maxSessionTimeout=40000 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 quorumListenOnAllIPs=false admin.serverPort=8080 dynamicConfigFile=/data/conf/zoo.cfg.dynamic.1000000b4
cat /data/conf/zoo.cfg.dynamic.1000000b4 server.1=fed-kafka-affirmedzk-0.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local:2888:3888:participant;0.0.0.0:21
seems to be some timing issue, please make this part of the code in zookeeperStart.sh more reliable
thanks
thanks for your continued support @anishakj @amuraru
I pulled the latest code from master, built ZK image and ran a continued install/uninstall tests, about 20 odd iterations it was ok, it failed at the 25th iteration. This issue still exists, unfortunately it happens most of the time in production (murphy's law coming into play)
When I was able to exec in to the failure pod-container I found
cat /data/myid 2
cat /data/conf/zoo.cfg 4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok dataDir=/data standaloneEnabled=false reconfigEnabled=true skipACL=yes metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true initLimit=10 syncLimit=2 tickTime=2000 globalOutstandingLimit=1000 preAllocSize=65536 snapCount=10000 commitLogCount=500 snapSizeLimitInKb=4194304 maxCnxns=0 maxClientCnxns=60 minSessionTimeout=4000 maxSessionTimeout=40000 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 quorumListenOnAllIPs=false admin.serverPort=8080 dynamicConfigFile=/data/conf/zoo.cfg.dynamic.1000000b4
cat /data/conf/zoo.cfg.dynamic.1000000b4 server.1=fed-kafka-affirmedzk-0.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local:2888:3888:participant;0.0.0.0:21
seems to be some timing issue, please make this part of the code in zookeeperStart.sh more reliable
thanks
@priyavj08 please provide the logs from zookeeper-1
before it performs a restart
Reopening issue since problem is not fixed still.
I am looking to reproduce this to get the first log from zk-1 pod when it crashes but in the mean time, here is the log from ZK-1 pod ZK1-log.txt
also, here is the output of describe from pod ZK-0, see connection refused from the probes
kc describe pod -n fed-kafka fed-kafka-affirmedzk-0
Name: fed-kafka-affirmedzk-0
Namespace: fed-kafka
Priority: 0
Node: priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa/10.163.66.203
Start Time: Tue, 20 Apr 2021 06:16:05 +0000
Labels: app=fed-kafka-affirmedzk
controller-revision-hash=fed-kafka-affirmedzk-545b7fbd4
kind=ZookeeperMember
release=fed-kafka-affirmedzk
statefulset.kubernetes.io/pod-name=fed-kafka-affirmedzk-0
Annotations: cni.projectcalico.org/podIP: 192.168.173.46/32
cni.projectcalico.org/podIPs: 192.168.173.46/32,fde6:7f0d:5c6c:ad36:56d7:55a8:2f31:c652/128
kubernetes.io/psp: permissive-network
Status: Running
IP: 192.168.173.46
IPs:
IP: 192.168.173.46
Controlled By: StatefulSet/fed-kafka-affirmedzk
Containers:
zookeeper:
Container ID: docker://a9b7dc72376f2849148f313ad4d34d55bc51c9dce6ec9d66e6c8ec62268f53b2
Image: cnreg-dev:5000/priya_vijaya/affirmed/zookeeper:0.2.12
Image ID: docker-pullable://cnreg-dev:5000/priya_vijaya/affirmed/zookeeper@sha256:2207ced4485ad175e6dc1ece88d44a43238db77b20e6fee543decd4d29f500e6
Ports: 2181/TCP, 2888/TCP, 3888/TCP, 7000/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
Command:
/usr/local/bin/zookeeperStart.sh
State: Running
Started: Tue, 20 Apr 2021 06:16:06 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 1536Mi
Requests:
cpu: 10m
memory: 1Gi
Liveness: exec [zookeeperLive.sh] delay=10s timeout=10s period=10s #success=1 #failure=3
Readiness: exec [zookeeperReady.sh] delay=10s timeout=10s period=10s #success=1 #failure=3
Environment:
ENVOY_SIDECAR_STATUS: (v1:metadata.annotations['sidecar.istio.io/status'])
Mounts:
/conf from conf (rw)
/data from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from zookeeper-sa-token-lfl4l (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-fed-kafka-affirmedzk-0
ReadOnly: false
conf:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: fed-kafka-affirmedzk-configmap
Optional: false
zookeeper-sa-token-lfl4l:
Type: Secret (a volume populated by a Secret)
SecretName: zookeeper-sa-token-lfl4l
Optional: false
QoS Class: Burstable
Node-Selectors:
Normal Scheduled 5m44s default-scheduler Successfully assigned fed-kafka/fed-kafka-affirmedzk-0 to priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Normal Pulled 5m43s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Container image "cnreg-dev:5000/priya_vijaya/affirmed/zookeeper:0.2.12" already present on machine Normal Created 5m43s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Created container zookeeper Normal Started 5m43s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Started container zookeeper Warning Unhealthy 5m31s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Liveness probe failed: + source /conf/env.sh ++ DOMAIN=fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local ++ QUORUM_PORT=2888 ++ LEADER_PORT=3888 ++ CLIENT_HOST=fed-kafka-affirmedzk-client ++ CLIENT_PORT=2181 ++ ADMIN_SERVER_HOST=fed-kafka-affirmedzk-admin-server ++ ADMIN_SERVER_PORT=8080 ++ CLUSTER_NAME=fed-kafka-affirmedzk ++ CLUSTER_SIZE=3 ++ nc 127.0.0.1 2181 ++ echo ruok (UNKNOWN) [127.0.0.1] 2181 (?) : Connection refused
in my recent tests of repeated install/uninstall after 18th iteration, it got in to a bad state.
for some reason ZK-1 failed to join the ensemble and ZK-2 is in crashloopback state (but not the error .my id 2 is missing)
attaching all the logs
failure-zk0.log failure-zk1.log failure-zk2.log
attaching the pod describe output
in my recent tests of repeated install/uninstall after 18th iteration, it got in to a bad state.
for some reason ZK-1 failed to join the ensemble and ZK-2 is in crashloopback state (but not the error .my id 2 is missing)
attaching all the logs
failure-zk0.log failure-zk1.log failure-zk2.log
attaching the pod describe output
@priyavj08 , Could you please confirm base zookeeper image version used is 3.6.1
@priyavj08 From the failure-zk1.log
, i could see connectivity issue of second node with the first one.
2021-04-20 10:30:23,206 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@1395] - Connection broken for id 1, my id = 2
java.io.EOFException
at java.base/java.io.DataInputStream.readInt(Unknown Source)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1383)
2021-04-20 10:30:23,208 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@1401] - Interrupting SendWorker thread from RecvWorker. sid: 1. myId: 2
2021-04-20 10:30:23,209 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker@1281] - Interrupted while waiting for message on queue
java.lang.InterruptedException
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown Source)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1446)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:98)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1270)
2021-04-20 10:30:23,210 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker@1293] - Send worker leaving thread id 1 my id = 2
Ran this inside the ZK pod, I am using build 3.6.1
Zookeeper version: 3.6.1--104dcb3e3fb464b30c5186d229e00af9f332524b, built on 04/21/2020 15:01 GMT
with the latest zk image and setting InitialDelayseconds to "60" for readiness probe, I was able to run install/uninstall tests continuously, tried 2 sets of 30 iterations. Haven't seen pod1/pod2 crashing with error "My id 2 not in the peer list"
question: will setting 60 seconds as Initiadelay for readiness probe have any impact on the functionality of the ZK ensemble?
another concern, during the first set of tests I saw the other issue though where ZK server in pod0 wasn't running and this caused issue in ZK pod 1 2021-04-20 11:34:50,819 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@373] - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)
with the latest zk image and setting InitialDelayseconds to "60" for readiness probe, I was able to run install/uninstall tests continuously, tried 2 sets of 30 iterations. Haven't seen pod1/pod2 crashing with error "My id 2 not in the peer list"
question: will setting 60 seconds as Initiadelay for readiness probe have any impact on the functionality of the ZK ensemble?
another concern, during the first set of tests I saw the other issue though where ZK server in pod0 wasn't running and this caused issue in ZK pod 1 2021-04-20 11:34:50,819 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@373] - Close of session 0x0 java.io.IOException: ZooKeeperServer not running at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544) at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332) at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)
In zookeeper cluster deployment, only after the first pod has started and running, second pod will start. Is it not happening in your case?
Also, initialDelaySeconds
will ensure that probes will start only after that time. Other than that it wont have any effect.
looks like ZK-0 pod was in running state, though the ZK server wasn't running and logs shows that error , similarly ZK-1 pod was in running state with errors . you can see pods status in the desc-output I had attached
looks like ZK-0 pod was in running state, though the ZK server wasn't running and logs shows that error , similarly ZK-1 pod was in running state with errors . you can see pods status in the desc-output I had attached
Are you seeing these errors with increased initialdelay?
@anishakj yes it happened with the set of tests I did by adding initialdelay but this is also not always seen.
@anishakj yes it happened with the set of tests I did by adding initialdelay but this is also not always seen.
By deafult initialDelaySeconds
is 10. Could you please set to 30
and give it a try
ZK-1 pod got into crashloopackoff couple of times but eventually worked, here is the log. This is the new fix right?
** server can't find fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local: NXDOMAIN
fed-kafka-affirmedzk-0 1/1 Running 0 122m fed-kafka-affirmedzk-1 1/1 Running 2 121m fed-kafka-affirmedzk-2 1/1 Running 0 117m fed-kafka-affirmedzk-5bd966c46d-5j6hh 1/1 Running 0 122m
ZK-1 pod got into crashloopackoff couple of times but eventually worked, here is the log. This is the new fix right?
- source /conf/env.sh ++ DOMAIN=fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local ++ QUORUM_PORT=2888 ++ LEADER_PORT=3888 ++ CLIENT_HOST=fed-kafka-affirmedzk-client ++ CLIENT_PORT=2181 ++ ADMIN_SERVER_HOST=fed-kafka-affirmedzk-admin-server ++ ADMIN_SERVER_PORT=8080 ++ CLUSTER_NAME=fed-kafka-affirmedzk ++ CLUSTER_SIZE=3
- source /usr/local/bin/zookeeperFunctions.sh ++ set -ex ++ hostname -s
- HOST=fed-kafka-affirmedzk-1
- DATA_DIR=/data
- MYID_FILE=/data/myid
- LOG4J_CONF=/conf/log4j-quiet.properties
- DYNCONFIG=/data/zoo.cfg.dynamic
- STATIC_CONFIG=/data/conf/zoo.cfg
- [[ fed-kafka-affirmedzk-1 =~ (.*)-([0-9]+)$ ]]
- NAME=fed-kafka-affirmedzk
- ORD=1
- MYID=2
- WRITE_CONFIGURATION=true
- REGISTER_NODE=true
- ONDISK_MYID_CONFIG=false
- ONDISK_DYN_CONFIG=false
- '[' -f /data/myid ']' ++ cat /data/myid
- EXISTING_ID=2
- [[ 2 == \2 ]]
- [[ -f /data/conf/zoo.cfg ]]
- ONDISK_MYID_CONFIG=true
- '[' -f /data/zoo.cfg.dynamic ']'
- set +e
- [[ -n '' ]]
- set -e
- set +e
- nslookup fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
- [[ 1 -eq 0 ]]
- nslookup fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
- grep -q 'server can'''t find fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local' Server: 10.96.0.10 Address: 10.96.0.10#53
** server can't find fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local: NXDOMAIN
- echo 'there is no active ensemble'
- ACTIVE_ENSEMBLE=false
- [[ true == true ]]
- [[ false == true ]]
- WRITE_CONFIGURATION=true
- [[ false == false ]]
- REGISTER_NODE=false
- [[ true == true ]]
- echo 'Writing myid: 2 to: /data/myid.'
- echo 2
- [[ 2 -eq 1 ]]
- [[ false == true ]]
- ZOOCFGDIR=/data/conf
- export ZOOCFGDIR
- echo Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration
- [[ ! -d /data/conf ]]
- echo Copying the /conf/zoo.cfg contents except the dynamic config file during restart ++ head -n -1 /conf/zoo.cfg there is no active ensemble Writing myid: 2 to: /data/myid. Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration Copying the /conf/zoo.cfg contents except the dynamic config file during restart ++ tail -n 1 /data/conf/zoo.cfg
- echo -e '4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok dataDir=/data standaloneEnabled=false reconfigEnabled=true skipACL=yes metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true initLimit=10 syncLimit=2 tickTime=2000 globalOutstandingLimit=1000 preAllocSize=65536 snapCount=10000 commitLogCount=500 snapSizeLimitInKb=4194304 maxCnxns=0 maxClientCnxns=60 minSessionTimeout=4000 maxSessionTimeout=40000 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 quorumListenOnAllIPs=false admin.serverPort=8080\ndynamicConfigFile=/data/zoo.cfg.dynamic'
- cp -f /conf/log4j.properties /data/conf
- cp -f /conf/log4j-quiet.properties /data/conf
- cp -f /conf/env.sh /data/conf
- '[' -f /data/zoo.cfg.dynamic ']'
- echo 'Node failed to register!' **Node failed to register!
- exit 1**
fed-kafka-affirmedzk-0 1/1 Running 0 122m fed-kafka-affirmedzk-1 1/1 Running 2 121m fed-kafka-affirmedzk-2 1/1 Running 0 117m fed-kafka-affirmedzk-5bd966c46d-5j6hh 1/1 Running 0 122m
yes, with this fix while on restart pod will become ready. If it is working for you after couple of pod restarts, could you please close the issue.
so far it is working fine. please close this bug
so far it is working fine. please close this bug
@priyavj08 Thanks for the confirmation. Please feel free to reopen if you see the same issue again.
Did anyone try setting initialDelaySeconds
when using the zookeeper helm chart?
I don't see that option in pravega/zookeeper-operator chart v0.2.12 or later
@priyavj08 @anishakj
@iampranabroy Does this work for you: https://github.com/pravega/zookeeper-operator/blob/2403ac54739d55a9e97333a837a45fea9dc0a96c/charts/zookeeper/values.yaml#L25
I'm facing the same issue. @anishakj even with long initialDelaySeconds (20,30,60) for liveness- and readinessProbes I'm unable to successfully deploy a zookeeper cluster with replicas > 1. Any further hint what could go wrong?
The issues stays the same:
2022-09-26 15:14:06,975 [myid:2] - ERROR [main:QuorumPeerMain@114] - Unexpected exception, exiting abnormally java.lang.RuntimeException: My id 2 not in the peer list at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1128) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)
the second replica tries to start and ends up in a CrashLoopBackOff.
Zookeeper-Operator is indirectly deployed using the Solr-Operator, version 0.2.14
Hey @mmoscher - Can you please try setting only InitialDelayseconds
to 60
for readiness
probe probes.readiness.initialDelaySeconds
as suggested above and see if it works?
@priyavj08 @anishakj - Did you try any additional changes?
Issue solved - at least for me, I think.
There were two unique issues (on my side) which kept me running into the error mentioned above. Reading this thread led me into a (somewhat) wrong direction, probing/testing the wrong solution.
Issue 1 - wrong network policies My deployment is blocking any egress/ingress traffic, using network policies, iff not explicitly allowed. For zookeeper internal traffic, i.e. zookeeper <-> zookeeper, I had one policy in place allowing port 2181 (client connection) only. However, for leader election this is not enough. After opening ports 2888 (quorum), 3888 (leader election), 7000 (metrics) and 8080 (admin) a fresh (new, unique) zookeeper cluster was able to bootstrap in the same namespace :facepalm: However, my solr-cloud related zookeeper cluster keeps failing. Even after deleting the zookeeper cluster resource and automatic recreation by the solr-operator.
Issue 2 - old state/wrong data on pvc
My zookeeper and solr keep their pvc's after deletion and reuse them if redeployed (due to dataStorage.persistent.reclaimPolicy: "Retain"
). After dropping all related pvcs and boostrapping a fresh solrcloud cluster the zookeeper cluster bootstraps successfully :thinking: . Furthermore, this explains why I was able to successfully deploy a zookeeper cluster on my local dev environment, as we do not use persistent storage during development.
It seems as zookeeper stores some data about his ID/State on disk, which could lead to later failures if corrupt and not cleaned up correctly. Unfortunately, I did not inspect the volumes and its data.
Maybe @priyavj08 and/or @anishakj have a clue/guess on this.
TL;DR: after setting correct network polices and cleaning up old (corrupt) data, i.e. all zookeeper related pvcs, I was able to successfully bootstrap a zookeeper cluster (4-times in row, same namespace and same k8s-cluster) . No need to adjust liveness- or readinessProbe's.
@iampranabroy hopefully these tips help you!
Thanks much @mmoscher Is it possible for you to share the updated network policy that you applied for zookeeper <-> zookeeper internal traffic?
@iampranabroy sure, here we go:
We've two policies in place related to solr/zookeeper. One (a) to allow traffic between the zookeeper members itself (z<->z) and another one (b) to allow traffic from solr to the zookeeper (s->z) pods.
Note: we do block any egress traffic from all pods by default and are following the "efault-deny-all-egress-traffic" principle. If you're doing it vice-versa, eg. blocking ingress traffic, you need to change the policies accordingly. Furthermore, the solr-instances accessing your zookeeper pods needs to have the custom label allow-zookeeper-access: "true"
set.
a)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-zookeeper-access-zookeeper
spec:
egress:
- ports:
- port: 2181
protocol: TCP
- port: 2888
protocol: TCP
- port: 3888
protocol: TCP
- port: 7000
protocol: TCP
- port: 8080
protocol: TCP
to:
- podSelector:
matchLabels:
kind: ZookeeperMember
technology: zookeeper
podSelector:
matchLabels:
kind: ZookeeperMember
technology: zookeeper
policyTypes:
- Egress
b)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-solr-access-zookeeper
spec:
egress:
- ports:
- port: 2181
protocol: TCP
- port: 7000
protocol: TCP
- port: 8080
protocol: TCP
to:
- podSelector:
matchLabels:
kind: ZookeeperMember
technology: zookeeper
podSelector:
matchLabels:
allow-zookeeper-access: "true"
policyTypes:
- Egress
Good luck :crossed_fingers:
Thanks, @mmoscher for sharing the details. A good point to keep in mind about NetworkPolicy
for zookeeper
and solr
cluster's internal communication.
In my case, the zookeeper
and solr
clusters are in the same namespace and they come up sometimes fine, but notice this error a few times. The error itself is not consistent, making it hard to debug.
Description
We are using zookeeper v.0.2.9. Sometimes (not in all environements) zookeeper-1 pod unable to start due to RuntimeException. Pod 's log:
Zookeeper cluster CRD description
Previously we used ZK v0.2.7 and there was no this issue. Also I tried the fix described in issue #259 , but it didn't helped.
Importance
Blocker issue. We need some fixes related to 0.2.9 version (https://github.com/pravega/zookeeper-operator/issues/257), so upgrade is required.
Location
(Where is the piece of code, package, or document affected by this issue?)
Suggestions for an improvement
(How do you suggest to fix or proceed with this issue?)