Reduce prometheus memory footprint by reduced labelling and defensive querying

a-thaler commented 4 years ago

Description For prometheus the label cardinality is key. As in kyma the default monitoring is just scraping any endpoint blindly and is labelling all metrics by default, the whole setup is not manageable (very exhaustive) from a memory perspective.

To successfully manage the memory footprint, endpoints must be scraped more conscious and queries should be optimized and cutted if bad behaving.

Actions:

Only metrics are getting scraped by prometheus which are in use by a dashboard/record/alertRule/kiali/jaeger
Only labels are getting applied to metrics which are needed, especially check the default labelling introduced by the prometheus-operator
Assure that complex queries are stopped at some point in time as they are too complex
Check existing queries if there are complex ones hitting the new border and improve them.

Reasons Reduce the memory footprint to make kyma more lightweight, have a more predictable memory consumption over time

Attachments

hisarbalik commented 4 years ago

Storage usage before metric re-labeling, screenshoot below show time series being scraped before re-labeling, currently promethues uses over 100K time series Screen Shot 2020-02-21 at 10 40 02

Storage usage after metric re-labeling, screenshoot below show time series being scraped after re-labeling, currently promethues uses around 50K time series Screen Shot 2020-02-21 at 10 40 42

Amount of time series will be dramatically reduced after reducing label collected and attached scraped metric, current result shows only metric re-labeling.

Application profiling is in progress, this will show us memory and CPU consumption after and before metric re-labeling

Following memory profiling show prometheus memory usage before and after optimization

Before optimization: Prometheus using around 900Mb memory heap-old2

After optimization: Prometheus using around 240Mb memory heap2

hisarbalik commented 4 years ago

A dedicated kyma cluster with new configuration deployed on https://grafana.mon-test.berlin.shoot.canary.k8s-hana.ondemand.com//?orgId=1 and available for reviewing changes.

Please check all dashboards and alert-rules related to your component and ensure they are still working as expected. In case you miss some metrics please let us know with the component/servicce-monitor name so we can add missing metrics to the new configuration

kyma-project / kyma

Reduce prometheus memory footprint by reduced labelling and defensive querying #7036