kyma-project / kyma

Kyma is an opinionated set of Kubernetes-based modular building blocks, including all necessary capabilities to develop and run enterprise-grade cloud-native applications.
https://kyma-project.io
Apache License 2.0
1.51k stars 404 forks source link

Optimize metrics footprint of prometheus #13258

Closed a-thaler closed 2 years ago

a-thaler commented 2 years ago

Description

The current setup of prometheus is a non-scalable setup with limits and these limits needs to be determined and documented well, see also https://github.com/kyma-project/kyma/issues/10033.

However, we already can see that the base metrics footprint is quite too big. To collect such amount of timeseries as a base line will be too expensive and also will stress the current setup without need.

A quote from somewhere: Typically, a 3-nodes Gardener cluster with Prometheus-Operator will generate ~40K active series. Depending on use cases, you might not actually use all of those metrics; for example, by only sending the series used in the default provided K8S dashboards, you'll be using ~8K active series instead of ~40K active series.

Goal Reduce the footprint to a minimum by dropping all metrics not actively used in dashboards, rules or alerts.

Actions Setting up a kyma cluster and executing the tests, there are already 80.000 series returned using query prometheus_tsdb_head_series

Using that query you can nicely see how many metrics a job creates, still that does not correlate 1:1 to the amount of timeseries count by (job) ({__name__=~".+"}), here the top10 results: scrape job metrics (~60.000)
apiserver 32480
istio-system/envoy-stats/0 14197
kubelet 7824
node-exporter 5012
kube-state-metrics 4184
ory-oathkeeper-maester-metrics 1577
eventing-controller-metrics 1242
logging-loki-headless 1212
addon-controller-metrics 1038
tracing-jaeger-metrics 870
848
api-gateway-metrics 833
ory-hydra-maester-metrics 823

Updated list after solving below actions

scrape job metrics (32.000)
kubelet 7691
istio-system/envoy-stats/0 6492
node-exporter 4802
apiserver 3669
kube-state-metrics 2629
logging-loki-headless 1304
helm-broker-etcd-stateful-client 778
serverless-controller-manager 630
507
monitoring-prometheus-istio-server 483
eventing-controller-metrics 293
monitoring-alertmanager 286
pilot 282

Getting down the numbers for that top10 jobs is crucial here, in the following some investigations results:

ghost commented 2 years ago

This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs. Thank you for your contributions.

a-thaler commented 2 years ago

We will stop that first round of optimizations, mainly istio metrics still have potential for improvements