Closed cespo closed 5 years ago
@liggitt any thoughts?
a few questions:
@kubernetes/sig-scalability-bugs have we seen anything like this in our scale tests?
- are you running any extension apiservers (like service catalog or metrics server)?
Yes, we are running metrics-server
- what version of etcd are you running?
etcdctl version: 3.3.8
We don't see the memory spikes on different cluster with ~40 namespaces running the same config
/assign @cheftako
I'm getting something similar to this on k8s 1.10.4
(etcd 3.2.22
) with no pods running beyond monitoring (Prometheus, metrics-server, kube-state-metrics, influxdb, grafana, heapster).
Based on previous comments, removed metrics-server 0.3.0
, and memory consumption dropped about 100MB, but memory usage keeps growing, just much slower.
edit: wrong graph
We're loading testing our cluster as well to figure out what levers to pull to be able to support 750 nodes. However, we are seeing similar issues with with the api server swallowing memory along with high response latency, high number of dropped requests ... essential it isn't scaling :(
Cluster config:
Memory Usage
CPU Usage
Number of Nodes, pods, containers Note: We actually had closer to 584 nodes at the peak of our
99th and 95th percentile
Goroutines
open File descriptors
Our load can be bursty since we use this cluster to to run client jobs. There could be a large influx of jobs at any given point which can lead to bursts of pods and instances into the cluster. From previous load tests we have seen that CPU will increase steadily with the burst of nodes and pods added but then come back down once the burst has been handled. However, we have never seen this issue with memory.
Any help would be appreciated!
@tuminoid Can you share metrics output from etcd?
@gyuho etcd metrics: https://gist.github.com/tuminoid/12e5cc36d9d866379553cedd1326438f
Left this empty system running over the weekend (without metrics server), and it actually seems to cap around 3GB. Its a single master/worker node for testing, with no load.
On another cluster, with 3 masters, without monitoring, but with our application running, it was linear growth to 10GB over 72 hours, then it purged 8GB and started creeping up again. In both cases, apiserver did not crash.
And for reference, apiserver invocation:
command:
- /usr/local/bin/kube-apiserver
- --v=0
- --logtostderr=false
- --log-dir=/var/log/kubernetes/apiserver
- --allow-privileged=true
- --delete-collection-workers=4
- --repair-malformed-updates=false
- --apiserver-count=1
- --request-timeout=1m
- --event-ttl=8h
- --profiling=false
- --advertise-address=192.168.10.30
- --bind-address=192.168.10.30
- --secure-port=6443
- --insecure-port=0
- --service-cluster-ip-range=10.254.0.0/16
- --storage-backend=etcd3
- --etcd-servers=https://192.168.10.30:2379
- --etcd-cafile=/etc/etcd/ssl/etcd-ca.pem
- --etcd-certfile=/etc/etcd/ssl/etcd.pem
- --etcd-keyfile=/etc/etcd/ssl/etcd-key.pem
- --enable-admission-plugins=NamespaceLifecycle,LimitRanger,SecurityContextDeny,ServiceAccount,NodeRestriction,DefaultStorageClass,ResourceQuota,DefaultTolerationSeconds,AlwaysPullImages,DenyEscalatingExec
- --disable-admission-plugins=PersistentVolumeLabel
- --authorization-mode=RBAC,Node
- --service-account-key-file=/etc/kubernetes/ssl/serviceaccount.pem
- --service-account-lookup=true
- --client-ca-file=/etc/kubernetes/ssl/ca.pem
- --tls-cert-file=/etc/kubernetes/ssl/apiserver.pem
- --tls-private-key-file=/etc/kubernetes/ssl/apiserver-key.pem
- --kubelet-https=true
- --kubelet-certificate-authority=/etc/kubernetes/ssl/ca.pem
- --kubelet-client-certificate=/etc/kubernetes/ssl/apiserver.pem
- --kubelet-client-key=/etc/kubernetes/ssl/apiserver-key.pem
- --kubelet-timeout=15s
- --feature-gates=AdvancedAuditing=false
- --audit-log-format=legacy
- --audit-log-path=/var/log/kubernetes/audit/audit.log
- --audit-log-maxage=30
- --audit-log-maxbackup=10
- --audit-log-maxsize=100
- --requestheader-client-ca-file=/etc/kubernetes/ssl/ca.pem
- --requestheader-allowed-names=aggregator
- --requestheader-extra-headers-prefix=X-Remote-Extra-
- --requestheader-group-headers=X-Remote-Group
- --requestheader-username-headers=X-Remote-User
We hit this same issue and I've been investigating this for a while now and seems that this is coming from apimachinery
.
Following describes process rss memory usage for one of our apiservers over last two days:
Two from the left are upstream v1.11.1
, the red spikey one is my experiment with forced debug.FreeOSMemory()
call once a minute and rightmost is v1.11.1
with this apimachinery
patch.
We tried earlier various v1.11.x
releases but this problem appeared in all of them. Now with this apimachinery
patch we see only very gradual increase in memory usage and so far I'm unable to clearly point out from where that is coming. The difference to release is still big so it seems that this patch is crucial for v1.11.x
.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
we can close this right?
/close
@dims: Closing this issue.
Is this a BUG REPORT or FEATURE REQUEST?: /kind bug /sig api-machinery
What happened: After v1.11 upgrade the API Server memory started steadily increasing using eventually all resources. At the same time we observed goroutines and open fds going up
Graphs: https://www.dropbox.com/sh/9fla8up2t70b80d/AAAnAmDydT5gF0OMz9rIfsvBa?dl=0
What you expected to happen: API Server to release chunks of memory after the use. Possible memory leak in the API Server process
How to reproduce it:
Anything else we need to know?: The cluster has ~45k namespaces
Environment:
kubectl version
):Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
AWS
CoreOS-stable-1800.7.0
uname -a
):Linux ip-10-155-64-206.ec2.internal 4.14.63-coreos #1 SMP Wed Aug 15 22:26:16 UTC 2018 x86_64 Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GenuineIntel GNU/Linux
Terraform