GA Operations: (Scalability) Optimize load induced on management cluster API Server

relyt0925 commented 3 years ago

Our perf squad after running with 440 clusters for a couple days noticed the API Server of the management cluster is repeatedly getting OOMKilled:

relyt0925 commented 3 years ago

management cluster master components were on 16 core 64 gig dedicated machines. 3 master components running in each AZ (one per AZ). Etcd was also on a dedicated 16 core 64 gig machine with one instance running in each AZ

relyt0925 commented 3 years ago

The current ratio we have in production is 116 nodes per zone * 3 zones = 348 nodes (dedicated 16 core 64 gig machines) 440 clusters (cluster components have instances in multiple AZs).

so that is a 1.26 cluster per node density on average.

relyt0925 commented 3 years ago

Currently at 54 hosted clusters Screen Shot 2021-07-30 at 8 00 43 AM

relyt0925 commented 3 years ago

The growing lines are the kube-apiserver containers the rest are all the other containers in the system.

relyt0925 commented 3 years ago

However: Things do appear to be stabilizing and if anything I am seeing the larger jumps correspond not to more management clusters but instead to adding more worker nodes into the cluster to hold the hosted control plane componets

An example of this is shown below: Screen Shot 2021-08-01 at 5 15 37 PM

relyt0925 commented 3 years ago

That time of increase was actually corresponding to when I scaled the cluster from a total of 120 nodes to 225 nodes. The rest of the graph is when I scaled from 82 to 175 hostedclusters. In that time I see no significant uptick trends of memory utilization of API Servers.

relyt0925 commented 3 years ago

/assign @relyt0925

relyt0925 commented 3 years ago

pprofresults_30plusGB_afterrestart.zip

^ these are all pprof results from a kube-apiserver running at 30+GB after a restart

relyt0925 commented 3 years ago

Screen Shot 2021-08-05 at 2 01 50 PM

relyt0925 commented 3 years ago

^The large spikes show when the restart of the kube-apiservers occur. The largest spikes you see are the following endpoints configmaps secrets

Which make sense as they seem like they are all kubelet requests getting all the information downloaded after a reconnect

relyt0925 commented 3 years ago

Screen Shot 2021-08-05 at 2 04 45 PM Screen Shot 2021-08-05 at 2 03 57 PM

relyt0925 commented 3 years ago

Secrets then spikes extremely large (1.4K watch requests rate per second over a minute's time) which seems really large

relyt0925 commented 3 years ago

after that flux things seem to stabilize: Screen Shot 2021-08-05 at 2 07 07 PM

relyt0925 commented 3 years ago

nothing major on the controller side of things during this time: Screen Shot 2021-08-05 at 2 11 12 PM

relyt0925 commented 3 years ago

Dropping --default-watch-cache-size to 33 significantly helps the load (default is 100)

relyt0925 commented 3 years ago

Screen Shot 2021-08-05 at 10 49 41 PM

derekwaynecarr commented 3 years ago

@sttts @deads2k any thoughts on optimal tuning of api-server to handle a thundering herd of re-establishing watch connections?

deads2k commented 3 years ago

@sttts @deads2k any thoughts on optimal tuning of api-server to handle a thundering herd of re-establishing watch connections?

We added a patch in 4.8 to address the infinite concurrency of watchers in prior releases. We noticed it for clusters with many kubelets and pods trying to rewatch lots of secrets. The change charges watchers during the establishment of their watch connections, so it improves cases where many watchers all attempt to watch at the same time. It does not improve cases of wide fanout when an update to a single secret is observed by a large number of watchers.

The PR (linked to the upstream that landed in 1.22) is https://github.com/openshift/kubernetes/pull/773.

@tkashem if you have more comments.

relyt0925 commented 3 years ago

thank you for this @deads2k !

tkashem commented 3 years ago

any thoughts on optimal tuning of api-server to handle a thundering herd of re-establishing watch connections?

in addition to openshift/kubernetes#773, on the management cluster you can add a new priority & fairness rule to set a concurrency limit for these watch requests (watch initialization only).

Also, the default values for max-requests-inflight and max-mutating-requests-inflight are too high for the management cluster, since the management cluster is managed by us, i would recommend lowering these values as well (depending on the number of cpu cores the master node has) https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/config/defaultconfig.yaml#L117-L121

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

relyt0925 commented 2 years ago

/remove-lifecycle stale

relyt0925 commented 2 years ago

Running performance tests with updates as well......

rgschofield commented 2 years ago

https://github.com/openshift/multus-admission-controller/issues/40

We've also raised the above issue that highlights instabilities following kube-apiserver restarts that are triggered by the multus admisson controller running on each node in the management cluster.

kas-restarts

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 2 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 2 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 2 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/hypershift/issues/306#issuecomment-1125289866): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

openshift / hypershift

GA Operations: (Scalability) Optimize load induced on management cluster API Server #306