project-codeflare / multi-cluster-app-dispatcher

Holistic job manager on Kubernetes
Apache License 2.0
108 stars 63 forks source link

[perf] MCAD constantly throttled #434

Open kpouget opened 1 year ago

kpouget commented 1 year ago

When looking at the MCAD logs, I see that it is constantly being throttled, and it seems to be requesting all the CRDs available in the cluster:

I0626 13:23:58.178716       1 request.go:591] Throttling request took 537.935392ms, request: GET:https://172.30.0.1:443/apis/monitoring.coreos.com/v1alpha1?timeout=32s
I0626 13:23:58.189164       1 request.go:591] Throttling request took 548.378517ms, request: GET:https://172.30.0.1:443/apis/operator.openshift.io/v1alpha1?timeout=32s
I0626 13:23:58.198620       1 request.go:591] Throttling request took 557.834296ms, request: GET:https://172.30.0.1:443/apis/scheduling.k8s.io/v1?timeout=32s
I0626 13:23:58.209032       1 request.go:591] Throttling request took 568.24236ms, request: GET:https://172.30.0.1:443/apis/imageregistry.operator.openshift.io/v1?timeout=32s
I0626 13:23:58.218479       1 request.go:591] Throttling request took 577.692789ms, request: GET:https://172.30.0.1:443/apis/serving.kserve.io/v1alpha1?timeout=32s
I0626 13:23:58.229013       1 request.go:591] Throttling request took 588.223945ms, request: GET:https://172.30.0.1:443/apis/network.openshift.io/v1?timeout=32s
I0626 13:23:58.238502       1 request.go:591] Throttling request took 597.708975ms, request: GET:https://172.30.0.1:443/apis/autoscaling.openshift.io/v1beta1?timeout=32s
I0626 13:23:58.249066       1 request.go:591] Throttling request took 608.272682ms, request: GET:https://172.30.0.1:443/apis/coordination.k8s.io/v1?timeout=32s
I0626 13:23:58.258451       1 request.go:591] Throttling request took 617.672599ms, request: GET:https://172.30.0.1:443/apis/helm.openshift.io/v1beta1?timeout=32s
I0626 13:23:58.268227       1 request.go:591] Throttling request took 627.428484ms, request: GET:https://172.30.0.1:443/apis/cloud.network.openshift.io/v1?timeout=32s
I0626 13:23:58.278857       1 request.go:591] Throttling request took 638.056671ms, request: GET:https://172.30.0.1:443/apis/node.k8s.io/v1?timeout=32s
I0626 13:23:58.288982       1 request.go:591] Throttling request took 648.181891ms, request: GET:https://172.30.0.1:443/apis/kubeflow.org/v1beta1?timeout=32s
I0626 13:23:58.298386       1 request.go:591] Throttling request took 657.590819ms, request: GET:https://172.30.0.1:443/apis/network.operator.openshift.io/v1?timeout=32s
I0626 13:23:58.308934       1 request.go:591] Throttling request took 668.127925ms, request: GET:https://172.30.0.1:443/apis/discovery.k8s.io/v1?timeout=32s
I0626 13:23:58.318269       1 request.go:591] Throttling request took 677.466823ms, request: GET:https://172.30.0.1:443/apis/cloudcredential.openshift.io/v1?timeout=32s
I0626 13:23:58.328651       1 request.go:591] Throttling request took 687.847126ms, request: GET:https://172.30.0.1:443/apis/operators.coreos.com/v2?timeout=32s
I0626 13:23:58.338012       1 request.go:591] Throttling request took 697.203084ms, request: GET:https://172.30.0.1:443/apis/flowcontrol.apiserver.k8s.io/v1beta2?timeout=32s
I0626 13:23:58.348316       1 request.go:591] Throttling request took 707.516076ms, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v2?timeout=32s
I0626 13:23:58.358629       1 request.go:591] Throttling request took 717.817759ms, request: GET:https://172.30.0.1:443/apis/operators.coreos.com/v1?timeout=32s
I0626 13:23:58.368962       1 request.go:591] Throttling request took 728.145512ms, request: GET:https://172.30.0.1:443/apis/flowcontrol.apiserver.k8s.io/v1beta1?timeout=32s
I0626 13:23:58.378359       1 request.go:591] Throttling request took 737.54763ms, request: GET:https://172.30.0.1:443/apis/migration.k8s.io/v1alpha1?timeout=32s
I0626 13:23:58.388688       1 request.go:591] Throttling request took 747.866993ms, request: GET:https://172.30.0.1:443/apis/config.openshift.io/v1?timeout=32s
I0626 13:23:58.398111       1 request.go:591] Throttling request took 757.287052ms, request: GET:https://172.30.0.1:443/apis/kfdef.apps.kubeflow.org/v1?timeout=32s
I0626 13:23:58.408403       1 request.go:591] Throttling request took 767.582294ms, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1?timeout=32s
I0626 13:23:58.418845       1 request.go:591] Throttling request took 778.011698ms, request: GET:https://172.30.0.1:443/apis/apps.openshift.io/v1?timeout=32s
I0626 13:23:58.428329       1 request.go:591] Throttling request took 787.499978ms, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1?timeout=32s
I0626 13:23:58.438809       1 request.go:591] Throttling request took 797.974212ms, request: GET:https://172.30.0.1:443/apis/authorization.openshift.io/v1?timeout=32s
I0626 13:23:58.448058       1 request.go:591] Throttling request took 807.243089ms, request: GET:https://172.30.0.1:443/apis/kubeflow.org/v1alpha1?timeout=32s
I0626 13:23:58.458594       1 request.go:591] Throttling request took 817.752805ms, request: GET:https://172.30.0.1:443/apis/build.openshift.io/v1?timeout=32s
I0626 13:23:58.467979       1 request.go:591] Throttling request took 827.137372ms, request: GET:https://172.30.0.1:443/apis/oauth.openshift.io/v1?timeout=32s
I0626 13:23:58.478054       1 request.go:591] Throttling request took 837.214141ms, request: GET:https://172.30.0.1:443/apis/performance.openshift.io/v1?timeout=32s
I0626 13:23:58.488482       1 request.go:591] Throttling request took 847.641595ms, request: GET:https://172.30.0.1:443/apis/project.openshift.io/v1?timeout=32s
I0626 13:23:58.498475       1 request.go:591] Throttling request took 857.702755ms, request: GET:https://172.30.0.1:443/apis/codeflare.codeflare.dev/v1alpha1?timeout=32s
I0626 13:23:58.507913       1 request.go:591] Throttling request took 867.086812ms, request: GET:https://172.30.0.1:443/apis/kubeflow.org/v1?timeout=32s

this cannot be a good thing for performance.

asm582 commented 1 year ago

Agree, we need to remove unused controllers in MCAD, here is a PR that we can revive: https://github.com/project-codeflare/multi-cluster-app-dispatcher/pull/277

asm582 commented 1 year ago

@astefanutti MCAD from the main branch has some remediations around this problem, can you recommend what more could be improved?

astefanutti commented 1 year ago

@asm582 these messages seem to be caused by an excessive usage of the discovery API, which are being client-side throttle, despite the QPS and maximum burst limits have already been increased.

We could speculatively close this, given the large refactoring that has happened lately, but a quick search in the code points to the genericresource.go file, that calls the discovery API for mapping the generic resources GVK to the API resource.

My suggestion would be to look at it more closely, and consider putting some caching or rate limiting mechanisms in place, for consuming the discovery API.