kyma-project / kyma

Kyma is an opinionated set of Kubernetes-based modular building blocks, including all necessary capabilities to develop and run enterprise-grade cloud-native applications.
https://kyma-project.io
Apache License 2.0
1.52k stars 405 forks source link

Reduce prometheus memory footprint by reduced labelling and defensive querying #7036

Closed a-thaler closed 4 years ago

a-thaler commented 4 years ago

Description For prometheus the label cardinality is key. As in kyma the default monitoring is just scraping any endpoint blindly and is labelling all metrics by default, the whole setup is not manageable (very exhaustive) from a memory perspective.

To successfully manage the memory footprint, endpoints must be scraped more conscious and queries should be optimized and cutted if bad behaving.

Actions:

Reasons Reduce the memory footprint to make kyma more lightweight, have a more predictable memory consumption over time

Attachments

hisarbalik commented 4 years ago

Storage usage before metric re-labeling, screenshoot below show time series being scraped before re-labeling, currently promethues uses over 100K time series Screen Shot 2020-02-21 at 10 40 02

Storage usage after metric re-labeling, screenshoot below show time series being scraped after re-labeling, currently promethues uses around 50K time series Screen Shot 2020-02-21 at 10 40 42

Amount of time series will be dramatically reduced after reducing label collected and attached scraped metric, current result shows only metric re-labeling.

Application profiling is in progress, this will show us memory and CPU consumption after and before metric re-labeling

Following memory profiling show prometheus memory usage before and after optimization

Before optimization: Prometheus using around 900Mb memory heap-old2

After optimization: Prometheus using around 240Mb memory heap2

hisarbalik commented 4 years ago

A dedicated kyma cluster with new configuration deployed on https://grafana.mon-test.berlin.shoot.canary.k8s-hana.ondemand.com//?orgId=1 and available for reviewing changes.

Please check all dashboards and alert-rules related to your component and ensure they are still working as expected. In case you miss some metrics please let us know with the component/servicce-monitor name so we can add missing metrics to the new configuration