This project contains tools to facilitate the deployment of Apache ZooKeeper on Kubernetes using StatefulSets. It requires Kubernetes 1.7 or greater.
The docker directory contains the Makefile for a Docker image that runs a ZooKeeper server using some custom scripts.
The manifests directory contains server Kubernetes manifests that can be used for demonstration purposes or production deployments. If you primarily deploy manifests directly you can modify any of these to fit your use case.
The helm directory contains a helm repository that deploys a ZooKeeper ensemble.
Regardless of whether you use manifests or helm to deploy your ZooKeeper ensemble, there are some common administration and configuration items that you should be aware of.
As notes in the limitation section, ZooKeeper membership can't be dynamically configured using the
latest stable version. You need to select an ensemble size that suites your use case. For demonstration purposes, or if
you are willing to tolerate at most one planned or unplanned failure, you should select an ensemble size of 3. This is
done by setting the spec.replicas
field of the StatefulSet to 3,
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: zk
spec:
serviceName: zk-hs
replicas: 3
...
and passing in 3 as --servers
parameter to the start-zookeeper
script.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: zk
spec:
...
command:
- sh
- -c
- "start-zookeeper \
--servers=3 \
...
For production use cases, 5 servers may be desirable. This allows you to tolerate one planned and one unplanned failure.
While ZooKeeper periodically snapshots all of its data to its data directory, the entire working data set must fit on
heap. The --heap
parameter of the start-zookeeper
script controls the heap size of the ZooKeeper servers,
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: zk
...
command:
- sh
- -c
- "start-zookeeper \
...
--heap=512M \
...
and the spec.template.containers[0].resources.requests.memory
controls the memory allocated to the JVM process.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: zk
spec:
...
containers:
- name: kubernetes-zookeeper
imagePullPolicy: Always
image: "gcr.io/google_containers/kubernetes-zookeeper:1.0-3.4.10"
resources:
requests:
memory: "1Gi"
...
You should probably not use heap sizes larger than 8 GiB. For production deployments you should consider setting the requested memory to the maximum of 2 GiB and 1.5 times the size of the configured JVM heap.
ZooKeeper is not a CPU intensive application. For a production deployment you should start with 2 CPUs and adjust as
necessary. For a demonstration deployment, you can set the CPUs as low as 0.5. The amount of CPU is configured by
setting the StatefulSet's spec.template.containers[0].resources.requests.cpus
.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: zk
spec:
...
containers:
- name: kubernetes-zookeeper
imagePullPolicy: Always
image: "gcr.io/google_containers/kubernetes-zookeeper:1.0-3.4.10"
resources:
requests:
memory: "4Gi"
cpu: "2"
...
The Headless Service that controls the domain of the ensemble must have two ports. The sever
port is used for
inter-server communication, and the leader-election
port is used to perform leader election.
apiVersion: v1
kind: Service
metadata:
name: zk-hs
labels:
app: zk
spec:
ports:
- port: 2888
name: server
- port: 3888
name: leader-election
clusterIP: None
selector:
app: zk
These ports must correspond to the container ports in the StatefulSet's .spec.template
and the parameters passed to
the start-zookeeper
script.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: zk
spec:
serviceName: zk-hs
replicas: 3
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
template:
...
ports:
- containerPort: 2181
name: client
- containerPort: 2888
name: server
- containerPort: 3888
name: leader-election
command:
- sh
- -c
- "start-zookeeper \
...
--election_port=3888 \
--server_port=2888 \
...
The Service used to load balance client connections has one port.
apiVersion: v1
kind: Service
metadata:
name: zk-cs
labels:
app: zk
spec:
ports:
- port: 2181
name: client
selector:
app: zk
The client
port must correspond to the container port specified in the StatefulSet's .spec.template
and the
parameter passed to the start-zookeeper
script.
```yaml
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: zk
spec:
serviceName: zk-hs
replicas: 3
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
template:
...
ports:
- containerPort: 2181
name: client
- containerPort: 2888
name: server
- containerPort: 3888
name: leader-election
command:
- sh
- -c
- "start-zookeeper \
...
--client_port=2181 \
...
Currently, the use of Persistent Volumes to provide durable, network attached storage is mandatory. If you use the
provided image with emptyDirs, you will likely suffer a data loss. The storage
field of the StatefulSet's
spec.volumeClaimTemplates
controls the storage the amount of storage allocated.
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 20Gi
The volumeMounts
in the StatefulSet's spec.template
control the mount point of the PersistentVolumes requested by
the PersistentVolumeClaims,
volumeMounts:
- name: datadir
mountPath: /var/lib/zookeeper
and the parameters passed to the start-zookeeper
script instruct the ZooKeeper server to use the PersistentVolume
backed directory for its snapshots and write ahead log.
--data_dir The directory where the ZooKeeper process will store its
snapshots. The default is /var/lib/zookeeper/data. This
directory must be backed by a persistent volume.
--data_log_dir The directory where the ZooKeeper process will store its
write ahead log. The default is
/var/lib/zookeeper/data/log. This directory must be
backed by a persistent volume.
Note that, because we use network attached storage there is no benefit to using multiple PersistentVolumes to segregate the snapshot and write ahead log to separate storage media.
ZooKeeper does not use wall clock time. Rather, it uses internal ticks that are based on an elapsed number of
milliseconds. The various timeouts for the ZooKeeper ensemble can be controlled by the parameters passed to the
start-zookeeper
script.
--tick_time The length of a ZooKeeper tick in ms. The default is
2000.
--init_limit The number of Ticks that an ensemble member is allowed
to perform leader election. The default is 10.
--sync_limit The maximum session timeout that the ensemble will
allows a client to request. The default is 5.
--max_session_timeout The maximum time in milliseconds for a client session
timeout. The default value is 2 * tick time.
--min_session_timeout The minimum time in milliseconds for a client session
timeout. The default value is 20 * tick time.
If you notice an excessive number of client timeouts or leader elections, you can tune these knobs as appropriate to increase stability, at the cost of slower failure detection.
By default, ZooKeeper will never delete its own snapshots, though it will compact its write ahead log. You must either
implement your own mechanism for retaining and removing snapshots, or configure the snapshot retention via the
start-zookeeper
script.
--snap_retain_count The maximum number of snapshots the ZooKeeper process
will retain if purge_interval is greater than 0. The
default is 3.
--purge_interval The number of hours the ZooKeeper process will wait
between purging its old snapshots. If set to 0 old
snapshots will never be purged. The default is 0.
ZooKeeper is not sensitive to the order in which Pods are started. All Pods in the StatefulSet may be launched and
terminated in parallel. This is accomplished by setting the spec.podManagementPolicy
to Parallel
.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: zk
spec:
...
podManagementPolicy: Parallel
...
The Update Strategy for the StatefulSet is set to RollingUpdate
. This enables the rolling update feature for
StatefulSets in Kubernetes 1.7, and it allows for modifications to the StatefulSet to be propagated to its Pods.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: zk
spec:
...
updateStrategy:
type: RollingUpdate
...
The zookeeper-ready
script can be used to check the liveness and readiness of ZooKeeper process. The example below
demonstrates how to configure liveness and readiness probes for the Pods in the StatefulSet.
readinessProbe:
exec:
command:
- sh
- -c
- "zookeeper-ready 2181"
initialDelaySeconds: 15
timeoutSeconds: 5
livenessProbe:
exec:
command:
- sh
- -c
- "zookeeper-ready 2181"
initialDelaySeconds: 15
timeoutSeconds: 5
An anti-affinity rule is specified to ensure that the ZooKeeper servers are spread across nodes in the Kubernetes cluster. This makes the ensemble resilient to node failures.
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- zk
topologyKey: "kubernetes.io/hostname"
The Log Level configuration may be modified via the log_level flag supplied to the start-zookeeper
script. However,
the location of the log output is not modifiable. The ZooKeeper process must be run in the foreground, and the log
information will be shipped to stdout. This is considered to be a best practice for containerized applications, and it
allows users to make use of the log rotation and retention infrastructure that already exists for K8s.
The zookeeper-metrics
script can be used to retrieve metrics from the ZooKeeper process and print them to stdout. A
recurring Kubernetes job can be used to collect these metrics and provide them to a collector.
bash$ kubectl exec zk-0 zookeeper-metrics
zk_version 3.4.10-1757313, built on 08/23/2016 06:50 GMT
zk_avg_latency 0
zk_max_latency 0
zk_min_latency 0
zk_packets_received 21
zk_packets_sent 20
zk_num_alive_connections 1
zk_outstanding_requests 0
zk_server_state follower
zk_znode_count 4
zk_watch_count 0
zk_ephemerals_count 0
zk_approximate_data_size 27
zk_open_file_descriptor_count 39
zk_max_file_descriptor_count 1048576