kontena / pharos-cluster

Pharos - The Kubernetes Distribution
https://k8spharos.dev/
Apache License 2.0
311 stars 43 forks source link

Defrag etcd automatically #888

Open jakolehm opened 5 years ago

jakolehm commented 5 years ago

What would you like to be added:

Defrag etcd automatically: https://coreos.com/etcd/docs/latest/op-guide/maintenance.html#defragmentation

Why is this needed:

Kubernetes already automatically does compaction but not defrag which may block etcd for longer period.

Timer commented 5 years ago

The snapshot count also seems exceptionally high at 100,000 -- the default (suggested) value looks to be 10,000: https://coreos.com/etcd/docs/latest/tuning.html#snapshots

This causes more-than-necessary disk utilization.

jnummelin commented 5 years ago

The snapshot count also seems exceptionally high at 100,000

Pharos should use the etcd default, we don't set that option at all: https://github.com/kontena/pharos-cluster/blob/master/lib/pharos/scripts/configure-etcd.sh

@Timer How did you get to 100,000 for snapshot count?

Reading some of the comments in K8S issues/PRs and I'm not at all convinced that we should run defrag periodically:

defrag is not super efficient and is not designed to run frequently. https://github.com/kubernetes/kubernetes/pull/45090#issuecomment-298067076

Just a reminder: defrag is a stop the world operation. If you have a db that is actually 1GB or so, defrag can freeze everything for 10+ second especially when you only run 1 etcd node. https://github.com/kubernetes/kubernetes/pull/45090#issuecomment-306872463

Timer commented 5 years ago

@Timer How did you get to 100,000 for snapshot count? @jnummelin

I saw this in the logs when the etcd server was booting.

Reading some of the comments in K8S issues/PRs and I'm not at all convinced that we should run defrag periodically

"Stop the world" only applies to the control components, so all workloads should continue operating nominally.