Closed relyt0925 closed 2 years ago
management cluster master components were on 16 core 64 gig dedicated machines. 3 master components running in each AZ (one per AZ). Etcd was also on a dedicated 16 core 64 gig machine with one instance running in each AZ
The current ratio we have in production is 116 nodes per zone * 3 zones = 348 nodes (dedicated 16 core 64 gig machines) 440 clusters (cluster components have instances in multiple AZs).
so that is a 1.26 cluster per node density on average.
Currently at 54 hosted clusters
The growing lines are the kube-apiserver containers the rest are all the other containers in the system.
However: Things do appear to be stabilizing and if anything I am seeing the larger jumps correspond not to more management clusters but instead to adding more worker nodes into the cluster to hold the hosted control plane componets
An example of this is shown below:
That time of increase was actually corresponding to when I scaled the cluster from a total of 120 nodes to 225 nodes. The rest of the graph is when I scaled from 82 to 175 hostedclusters. In that time I see no significant uptick trends of memory utilization of API Servers.
/assign @relyt0925
pprofresults_30plusGB_afterrestart.zip
^ these are all pprof results from a kube-apiserver running at 30+GB after a restart
^The large spikes show when the restart of the kube-apiservers occur. The largest spikes you see are the following
endpoints
configmaps
secrets
Which make sense as they seem like they are all kubelet requests getting all the information downloaded after a reconnect
Secrets then spikes extremely large (1.4K watch requests rate per second over a minute's time) which seems really large
after that flux things seem to stabilize:
nothing major on the controller side of things during this time:
Dropping --default-watch-cache-size to 33 significantly helps the load (default is 100)
@sttts @deads2k any thoughts on optimal tuning of api-server to handle a thundering herd of re-establishing watch connections?
@sttts @deads2k any thoughts on optimal tuning of api-server to handle a thundering herd of re-establishing watch connections?
We added a patch in 4.8 to address the infinite concurrency of watchers in prior releases. We noticed it for clusters with many kubelets and pods trying to rewatch lots of secrets. The change charges watchers during the establishment of their watch connections, so it improves cases where many watchers all attempt to watch at the same time. It does not improve cases of wide fanout when an update to a single secret is observed by a large number of watchers.
The PR (linked to the upstream that landed in 1.22) is https://github.com/openshift/kubernetes/pull/773.
@tkashem if you have more comments.
thank you for this @deads2k !
any thoughts on optimal tuning of api-server to handle a thundering herd of re-establishing watch connections?
in addition to openshift/kubernetes#773, on the management cluster you can add a new priority & fairness rule to set a concurrency limit for these watch requests (watch initialization only).
Also, the default values for max-requests-inflight
and max-mutating-requests-inflight
are too high for the management cluster, since the management cluster is managed by us, i would recommend lowering these values as well (depending on the number of cpu cores the master node has)
https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/config/defaultconfig.yaml#L117-L121
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
/remove-lifecycle stale
Running performance tests with updates as well......
https://github.com/openshift/multus-admission-controller/issues/40
We've also raised the above issue that highlights instabilities following kube-apiserver restarts that are triggered by the multus admisson controller running on each node in the management cluster.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
Our perf squad after running with 440 clusters for a couple days noticed the API Server of the management cluster is repeatedly getting OOMKilled: