Use memory storage for etcd

aojea commented 4 years ago

What would you like to be added:

Configure etcd storage in memory to improve the performance

Why is this needed:

etcd causes a very high disk io, and this can cause performance issues, especially if there are several kind clusters running in the same system, because you end with a lot of process writing to disk causing latency and affecting the other applications using the same disk,

Since https://github.com/kubernetes-sigs/kind/pull/779 , the var filesystems was no longer running on the container filesystem, improving the performance, however, the etcd storage continues to be on the disk, as we can see in the pod manifest:

  etcd-data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/etcd
    HostPathType:  DirectoryOrCreate

Ideally, we should have /var/lib/etcd/in memory, since the clusters are created to be created and destroyed and the information shouldn't be persistent.

I have doubts about the best approach:

Should be this modified in kind creating a new tmpfs volume for etcd?
Can this be modified in kubeadm so we can mount the etcd-data in memory or in another location of the node that's in memory?
...

NOTES

etcd io accumulated iotop -a

26206 be/4 root          0.00 B    192.00 K  0.00 %  1.04 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26196 be/4 root          0.00 B    224.00 K  0.00 %  0.98 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26288 be/4 root          0.00 B    216.00 K  0.00 %  0.94 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26249 be/4 root          0.00 B    180.00 K  0.00 %  0.88 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26266 be/4 root          0.00 B     52.00 K  0.00 %  0.47 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26187 be/4 root          0.00 B     52.00 K  0.00 %  0.42 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26267 be/4 root          0.00 B     48.00 K  0.00 %  0.37 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26192 be/4 root          0.00 B     60.00 K  0.00 %  0.36 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26263 be/4 root          0.00 B     52.00 K  0.00 %  0.31 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26261 be/4 root          0.00 B     64.00 K  0.00 %  0.28 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
19155 be/4 root          0.00 B      0.00 B  0.00 %  0.19 % [kworker/1:2]
26286 be/4 root          0.00 B     28.00 K  0.00 %  0.18 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
26289 be/4 root          0.00 B     32.00 K  0.00 %  0.16 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt
  578 be/4 root          0.00 B      2.00 M  0.00 %  0.16 % [btrfs-transacti]
26268 be/4 root          0.00 B     28.00 K  0.00 %  0.11 % etcd --advertise-client-urls=htt~e=/etc/kubernetes/pki/etcd/ca.crt

aojea commented 4 years ago

/cc @BenTheElder @neolit123

neolit123 commented 4 years ago

Can this be modified in kubeadm so we can mount the etcd-data in memory or in another location of the node that's in memory?

kubeadm passes --data-dir=/var/lib/etcd to etcd and mounts this directory using hostPath. we can just try:

        emptyDir:
          medium: Memory

but this means kubeadm init / join commands need to: 1) use phases to skip / customize the "manifests" phase or 2) deploy etcd, patch manifest, restart static pod

etcd causes a very high disk io, and this can cause performance issues, especially if there are several kind clusters running in the same system, because you end with a lot of process writing to disk causing latency and affecting the other applications using the same disk,

k/k master just moved to 3.3.15, while 1.15 uses an older version. is this a regression? and IDLE cluster should not have high disk i/o.

if this disk i/o suddenly became a problem this should be in a k/k issue.

BenTheElder commented 4 years ago

Etcd is going to be writing all the constantly updated objects, no? (Eg node status)

It would be trivial to test kind with memory backed etcd by adjusting node creation, but I don't think you'd ever run a real cluster not on disk... 🤔

aojea commented 4 years ago

Etcd is going to be writing all the constantly updated objects, no? (Eg node status)

yeah, data need to persist to disk to provide consistency

It would be trivial to test kind with memory backed etcd by adjusting node creation, but I don't think you'd ever run a real cluster not on disk... 🤔

Absolutely, real clusters must use disks, this is only meant to be used for testing, my rationale is that these k8s cluster are ephemeral, thus the etcd clusters don't need to "persist" data on disk

Can this be patched with the kind config? It will be enough with passing a different folder than --data-dir=/var/lib/etcd

BenTheElder commented 4 years ago

You can test this more or less with no changes by making a tmpfs on the host and configuring it to mount there on a control plane.

You could also edit the kind control plane creation process to put a tmpfs here on the node

We should experiment, but I think we do eventually want durable etcd for certain classes of testing..

BenTheElder commented 4 years ago

Also worth pointing out:

our CI is backed by SSD
I'm not aware of any other cluster implementation not backing etcd with disk, including eg hack/local-up-cluster

aojea commented 4 years ago

yeah, for k8s CI is not a big problem, but for users that run kind locally, it is. It took me a while to understand what was slowing down my system until I've found that my kind clusters were causing big latency in one of my disks. I just want to test and document the differences :)

aojea commented 4 years ago

ok, here is how to run etcd using memory storage for reference

Create the memory storaga

sudo mkdir /tmp/etcd
sudo mount -t tmpfs  /tmp/etcd

Mount it on the control nodes

kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
  extraMounts:
  - containerPath: /var/lib/etcd
    hostPath: /tmp/etcd

aojea commented 4 years ago

/reopen per conversation in slack https://kubernetes.slack.com/archives/CEKK1KTN2/p1570202642295000?thread_ts=1570196798.288800&cid=CEKK1KTN2

I'd like to find a way to make this easier to configure, mainly for people that want to use kind in their laptops and not in CIs, etcd writing constantly to disk directly is no adding any benefit in this particular scenario

aojea commented 4 years ago

You could also edit the kind control plane creation process to put a tmpfs here on the node

I think this will work

We should experiment, but I think we do eventually want durable etcd for certain classes of testing..

I was thinking more about this, and can't see the "durability" difference between using a folder inside the container or using a tmpfs volume for the etcd data dir, the data will be available as long as the container is alive, no?

However, etcd writing to a tmpfs volume will be a big performance improvement, at a cost of less memory available, of course

home/aojeagarcia/docker/volumes/5d2d2cab7dcb7c93b9a8a5f8591462caf4fbca5c332e663aa4628702b3d2dc50/_data/lib/etcd/member # du -sh *
1.5M    snap
245M    wal

neolit123 commented 4 years ago

However, etcd writing to a tmpfs volume will be a big performance improvement, at a cost of less memory available, of course

i'd be interested if this will prevent me from testing 3 CP setups with kind on my setup. it doesn't have RAM for 4 CPs :)

BenTheElder commented 4 years ago

I was thinking more about this, and can't see the "durability" difference between using a folder inside the container or using a tmpfs volume for the etcd data dir, the data will be available as long as the container is alive, no?

It's NOT a folder inside the container, it's on a volume.

When we fix kind to survive host reboots (and we will) then this will break it again.

It also will consume more RAM of course.

aojea commented 4 years ago

It's NOT a folder inside the container, it's on a volume.

https://github.com/kubernetes-sigs/kind/blob/master/pkg/internal/cluster/providers/docker/provision.go#L164-L169

I see it now :man_facepalming:

aojea commented 4 years ago

can this be causing timeouts in the CI with slow disks?

BenTheElder commented 4 years ago

https://github.com/kubernetes-sigs/kind/issues/928#issuecomment-541964546

^^ possibly for istio, doesn't look like Kubernetes CI is seeing timeouts at this point. That's not the pattern with the broken pipe.

Even for istio, I doubt it's "because they aren't doing this" but it could be "because they are otherwise using too much IOPs for the allocated disks" IIRC they are also on GCP PD-SSD which is quite fast.

BenTheElder commented 4 years ago

for CI I think the better pattern I want to try is to use a pool of PDs from some storage class to replace the emptyDir.

I've been mulling how we could do this and persist some of the images in a clean and sane way, but imo this is well out of scope for the kind project.

aojea commented 4 years ago

for CI I think the better pattern I want to try is to use a pool of PDs from some storage class to replace the emptyDir.

I've been mulling how we could do this and persist some of the images in a clean and sane way, but imo this is well out of scope for the kind project.

I think that this is only an issue for people using kind in their laptops or workstations, totally agree with you on the CI use case

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

BenTheElder commented 4 years ago

did we wind up testing this in CI?

aojea commented 4 years ago

did we wind up testing this in CI?

nope, what option do you want to test in the CI, using etcd in memory?

BenTheElder commented 4 years ago

nope, what option do you want to test in the CI, using etcd in memory?

yeah, we should see how it actually performs

aojea commented 4 years ago

nope, what option do you want to test in the CI, using etcd in memory?

yeah, we should see how it actually performs

hehe, when I was working in Midonet it used zookeeper as a source of truth, the CI started to fly once we put it in memory, IIRC etcd and zookeeper need to flush the data to guarantee the consistency that means lot of IOPS, the improvement will be the difference of IOPS between memory and disk (SSD) ... that should be considerable

BenTheElder commented 4 years ago

theory is nice, measurements are better :-)

On Tue, Mar 10, 2020 at 9:39 AM Antonio Ojea notifications@github.com wrote:

nope, what option do you want to test in the CI, using etcd in memory?

yeah, we should see how it actually performs

hehe, when I was working in Midonet it used zookeeper as a source of truth, the CI started to fly once we put it in memory, IIRC etcd and zookeeper need to flush the data to guarantee the consistency that means lot of IOPS, the improvement will be the difference of IOPS between memory and disk (SSD) ... that should be considerable

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/kind/issues/845?email_source=notifications&email_token=AAHADK4VJGYOEXHNL5HUS73RGZUK3A5CNFSM4IUPOOHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOMFXII#issuecomment-597187489, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADKYO6YJC7RDZRDT5V3TRGZUK3ANCNFSM4IUPOOHA .

aojea commented 4 years ago

looking forward to it :smile:

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/kind/issues/845#issuecomment-611675276): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

aojea commented 4 years ago

/reopen /lifecycle frozen /assign

The goal is to do a serious benchmarking, comparing with and without etcd using memory storage to understand better the pros and cons.

The configuration to use etcd in memory, for reference,

 cat <<EOF > "${ARTIFACTS}/kind-config.yaml"
# config for 1 control plane node and 2 workers (necessary for conformance)
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
kubeadmConfigPatches:
- |
  kind: ClusterConfiguration
  metadata:
    name: config
  etcd:
    local:
      dataDir: "/tmp/lib/etcd"
EOF

k8s-ci-robot commented 4 years ago

@aojea: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/kind/issues/845#issuecomment-616202792): >/reopen >/lifecycle frozen >/assign > >The goal is to do a serious benchmarking, comparing with and without etcd using memory storage to understand better the pros and cons. > >The configuration to use etcd in memory, for reference, > >``` > cat < "${ARTIFACTS}/kind-config.yaml" ># config for 1 control plane node and 2 workers (necessary for conformance) >kind: Cluster >apiVersion: kind.x-k8s.io/v1alpha4 >nodes: >- role: control-plane >- role: worker >- role: worker >kubeadmConfigPatches: >- | > kind: ClusterConfiguration > metadata: > name: config > etcd: > local: > dataDir: "/tmp/lib/etcd" >EOF >``` Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

BenTheElder commented 4 years ago

I still think this is a bad idea and conflicts with host reboot support.

Besides losing persistence you also consume more memory, and we're already allowing swap.

In CI the CI SSD should perform fairly well, locally it depends but memory tends to be more of an issue for users than disk.

aojea commented 4 years ago

I'm just curious about the difference and want to document it, I agree that this most likely is not going to be part of KIND, but it can make a difference for users with CIs without SSD, per example.

I want to understand how much memory allocates etcd too :).

koxu1996 commented 4 years ago

@aojea Thank you for snippet with tmpfs, now my cluster is running much smoother.

thavlik commented 4 years ago

kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes:

role: control-plane

role: worker

role: worker kubeadmConfigPatches:

| kind: ClusterConfiguration metadata: name: config etcd: local: dataDir: "/tmp/lib/etcd"

Is this supposed to work still? It confers no performance increase for me. None of the manifests in this issue decreased the amount of time it took kind to create a cluster. The configs were accepted without errors, but there was no change in execution time.

I'm terribly interested in helping out with this feature. I've been working on https://github.com/midcontinentcontrols/kindest to assist with microservice development. Etcd initialization is a bottleneck with the dev workflow, and persistence is unnecessary.

aojea commented 4 years ago

None of the manifests in this issue decreased the amount of time it took kind to create a cluster

and that's not the bottleneck creating a cluster, this patch is because etcd is very IO intensive, if you are using slow disks or a laptop with other apps running you will notice the difference, but the time to create a cluster does not depend on this.

persistence is unnecessary.

well, for a CI or dev environment it may not be necessary, but any production clusters needs to persist the data 😅

BenTheElder commented 4 years ago

Persistence beyond host reboot was the most highly requested issue in the tracker, people do use kind outside of CI ...

For performance improvements to startup the most impact will be had improving the upstream bootstrapping / upstream component performance. You'll be hard pressed to find a kubeadm environment starting much faster than kind with the node image already downloaded...

Apiserver, kubeadm, kubelet etc. are all upstream and the majority of the boot time is spent on those things coming up.

On Thu, May 28, 2020, 08:03 Antonio Ojea notifications@github.com wrote:

None of the manifests in this issue decreased the amount of time it took kind to create a cluster

and that's not the bottleneck creating a cluster, this patch is because etcd is very IO intensive, if you are using slow disks or a laptop with other apps running you will notice the difference, but the time to create a cluster does not depend on this.

persistence is unnecessary.

well, for a CI or dev environment it may not be necessary, but any production clusters needs to persist the data 😅

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/kind/issues/845#issuecomment-635406892, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADKY2VNS3T5NEIZQNLOLRTZ4MRANCNFSM4IUPOOHA .

thavlik commented 4 years ago

I've tried using tmpfs for Docker's data-root (silly, yes) and there is no performance benefit to that either, so I am wondering how exactly I should go about optimizing cluster creation. I am able to confirm that my tmpfs grows by about 1.2gb while my persistent disks are untouched by the cluster creation process. While the tmpfs grows, all cores are basically idle. Sometimes I will see a relevant process (e.g. kubeadm) jump to ~1% usage.

Any ideas? Obviously setting data-root is far from ideal. At this point I'm just trying to figure out how this all should behave.

BenTheElder commented 4 years ago

We've already tried this and taken nearly all of the obvious steps that don't require upstream changes. Boot time is very important to us.

As I said, the upstream Kubernetes components may be optimizable. The bootstrapping process with kubeadm is suspiciously long, but you'll have to track down what's slow yourself, we haven't gotten to this yet.

On Fri, May 29, 2020, 07:52 Thomas Havlik notifications@github.com wrote:

I've tried using tmpfs for Docker's data-root (silly, yes) and there is no performance benefit to that either, so I am wondering how exactly I should go about optimizing cluster creation. I am able to confirm that my tmpfs grows by about 1.2gb while my persistent disks are untouched by the cluster creation process. While the tmpfs grows, all cores are basically idle. Sometimes I will see a relevant process (e.g. kubeadm) jump to ~1% usage.

Any ideas? Obviously setting data-root is far from ideal. At this point I'm just trying to figure out how this all should behave.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/kind/issues/845#issuecomment-636017110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHADK6TSVY7MWNMVBCEPL3RT7D3VANCNFSM4IUPOOHA .

aojea commented 4 years ago

The bootstrapping process with kubeadm is suspiciously long

If you are not afraid of security, and if is possible in kubeadm ( I really don't know) avoid the certificate generation ... Maybe is possible to include some well known certificate

warmchang commented 4 years ago

There is an unsafe "--unsafe-no-fsync" flag added in etcd to disables fsync.

FYI: https://github.com/etcd-io/etcd/pull/11946

BenTheElder commented 4 years ago

Yeah, we're very interested in that once it's available in kubeadm's etcd.

BenTheElder commented 3 years ago

Circling back because this came up again today: I experimented with tempfs + the unsafe no fsync flag late last year and didn't see measurable improvements on my hardware (couple different dev machines), YMMV, this still doesn't seem to be a clear win even when persistence is not interesting, it depends on the usage and hardware.

aojea commented 3 years ago

for CIs like github actions there is a measurable difference when running the e2e test suite :)

dprotaso commented 2 years ago

for CIs like github actions there is a measurable difference when running the e2e test suite :)

Yeah - it's just another potential failure mode that would be nice to avoid

cnfatal commented 2 years ago

Below config file works well to run etcd in memory

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
- |
  apiVersion: kubeadm.k8s.io/v1beta2
  kind: ClusterConfiguration
  etcd:
    local:
      dataDir: /tmp/etcd

The /tmp and /run dir in kind node mount at a tmpfs.

on podman : https://github.com/kubernetes-sigs/kind/blob/36f229f28eaa8a215ec76b2ce278be45a4590875/pkg/cluster/internal/providers/podman/provision.go#L195-L196

on docker:

https://github.com/kubernetes-sigs/kind/blob/5657682609bc9ae52c1adf7164fa05d394d8f9ca/pkg/cluster/internal/providers/docker/provision.go#L236-L237

aojea commented 1 year ago

as pointed out by Ben , we are going to have a performance hit because of etcd

All single node v3.x clusters are affected. Fix is expected to come with a 4-10% performance degradation, making single node cluster performance more in line with multi-node clusters. No performance change is expected for multi-node clusters.

https://github.com/kubernetes/kubernetes/pull/112690

BenTheElder commented 1 year ago

This should work for all current supported Kubernetes versions and is slightly terser:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
kubeadmConfigPatches:
- |
  kind: ClusterConfiguration
  etcd:
    local:
      dataDir: /tmp/etcd

BenTheElder commented 1 year ago

Maybe let's make a page to cover performance @aojea?

We have other related commonly discovered issues that are only in "known issues" currently. We could leave stub entries but move performance considerations to a new docs page that covers this technique + inotify limits etc.

I think the config to enable this is small enough to just document and it's too breaking to e.g. enable by default.

We can also suggest other tunable flags and host configs some of which kind shouldn't touch itself.

aojea commented 1 year ago

agree, these are recurrent questions, better to aggregate this information

kubernetes-sigs / kind

Use memory storage for etcd #845