High etcd I/O with new (idle) kubeadm cluster

zatricky commented 6 years ago

What keywords did you search in kubeadm issues before filing this one?

various combinations of: kubernetes k8s etcd disk idle iops

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version: &version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T16:05:18Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Kubernetes version: Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.5", GitCommit:"cce11c6a185279d037023e02ac5249e14daa22bf", GitTreeState:"clean", BuildDate:"2017-12-07T16:16:03Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: XEN Hypervisor cluster; storage is spindle-backed NFS; Dual Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz in each XEN Node
OS (e.g. from /etc/os-release): Ubuntu VERSION="16.04.3 LTS (Xenial Xerus)"
Kernel (e.g. uname -a): Linux kube-ci-runner-master.4d.eset.co.za 4.10.0-42-generic #46~16.04.1-Ubuntu SMP Mon Dec 4 15:57:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Others:

What happened?

Built new tiny cluster with calico networking master: 1.5GB memory; 6x CPU cores; 20GB root; 100GB reserved for docker node: 2.5GB memory; 6x CPU cores; 20GB root; 100GB reserved for docker

Wait for all basic services to be in stable Running state. Waited 30 minutes. Found master server at loadavg of 1.0, iotop shows etcd constantly reading and writing between 10kBps and 50kBps despite there being zero activity or cluster usage.

What you expected to happen?

Likely if storage were ssd-backed, the loadavg would be closer to nil - but I still feel this is an unacceptable situation. After all basic services are in Running state, expectation is that idle system would have all loadavgs be close to nil and disk throughput would be close to nil. Periodic maintenance-type activity should be the only exception.

How to reproduce it (as minimally and precisely as possible)?

New basic kubeadm cluster with calico

Anything else we need to know?

Spindle backing storage means storage does not have SSD-type performance. I'm considering adjusting etcd's "--snapshot-count" parameter to 100 000 (default I believe is 10 000). Considering that there should be near-zero activity in the first place, I'm not sure that will make a difference. There are various resolved issues for etcd itself - but I currently believe this problem is more likely related to kubernetes creating unnecessary work for etcd rather than etcd itself having a bug. I could be wrong of course. https://github.com/coreos/etcd/issues/3255 https://github.com/coreos/etcd/issues/4058 https://github.com/coreos/etcd/issues/2486 https://github.com/coreos/etcd/issues/2409

luxas commented 6 years ago

ping @kubernetes/sig-scalability-bugs @wojtek-t @xiang90 @hongchaodeng @shyamjvs @porridge WDYT? I don't think this is kubeadm-specific, but let's try to figure out the root cause.

porridge commented 6 years ago

Unfortunately I have close to zero experience regarding etcd performance. 50kB/s does not strike me as very high. I think even a one-node cluster provides a constant stream of heartbeat updates that turn into etcd writes eventually. But I don't think that should be every single second, so indeed it seems somewhat high. A few questions @zatricky :

is there indeed just one node apart from master?
by "basic services" do you mean the ones in kube-system or something else? which of them do not reach Running and why?
which processes on the master are keeping CPU busy? user or kernel mode?
look at kube-apiserver.log - any signs of high request rate there?

In general, this does not seem sig-scalability related, as it's a tiny cluster and the node size seems sufficient to host all necessary addons. Unless answers to the above questions reveal some smoking gun, I'd ask sig-api-machinery since I think they own the etcd integration?

shyamjvs commented 6 years ago

+1 to what @porridge wrote above. More specifics are required of what kinds of requests are seen.

zatricky commented 6 years ago

Apologies for the delay, @porridge. Not sure why I didn't see a notification :-/

is there indeed just one node apart from master? Yes. Exactly one "worker" node and one "master" node
by "basic services" do you mean the ones in kube-system or something else? which of them do not reach Running and why? Other than calico, nothing was deployed to the "cluster". The 30 minutes is in reference to the time I left the cluster alone to "settle down". All services provided after using kubeadm to initialise the cluster are in working order.
which processes on the master are keeping CPU busy? user or kernel mode? I will have to re-deploy and test. I recall there being no user CPU. The loadavg was entirely related to waiting for disk.
look at kube-apiserver.log - any signs of high request rate there? Same. I will re-deploy and test

timothysc commented 6 years ago

This isn't related to the deployment of kubeadm, there are constant heartbeats and status updates in the system which are all persisted to disk. Feel free to open an issue against the etcd repo if you have more comments.

kubernetes / kubeadm