Optimize metrics footprint of prometheus

a-thaler commented 2 years ago

Description

The current setup of prometheus is a non-scalable setup with limits and these limits needs to be determined and documented well, see also https://github.com/kyma-project/kyma/issues/10033.

However, we already can see that the base metrics footprint is quite too big. To collect such amount of timeseries as a base line will be too expensive and also will stress the current setup without need.

A quote from somewhere: Typically, a 3-nodes Gardener cluster with Prometheus-Operator will generate ~40K active series. Depending on use cases, you might not actually use all of those metrics; for example, by only sending the series used in the default provided K8S dashboards, you'll be using ~8K active series instead of ~40K active series.

Goal Reduce the footprint to a minimum by dropping all metrics not actively used in dashboards, rules or alerts.

Actions Setting up a kyma cluster and executing the tests, there are already 80.000 series returned using query prometheus_tsdb_head_series

Using that query you can nicely see how many metrics a job creates, still that does not correlate 1:1 to the amount of timeseries `count by (job) ({__name__=~".+"})`, here the top10 results:	scrape job	metrics (~60.000)
apiserver	32480
istio-system/envoy-stats/0	14197
kubelet	7824
node-exporter	5012
kube-state-metrics	4184
ory-oathkeeper-maester-metrics	1577
eventing-controller-metrics	1242
logging-loki-headless	1212
addon-controller-metrics	1038
tracing-jaeger-metrics	870
	848
api-gateway-metrics	833
ory-hydra-maester-metrics	823

Updated list after solving below actions

scrape job	metrics (32.000)
kubelet	7691
istio-system/envoy-stats/0	6492
node-exporter	4802
apiserver	3669
kube-state-metrics	2629
logging-loki-headless	1304
helm-broker-etcd-stateful-client	778
serverless-controller-manager	630
507
monitoring-prometheus-istio-server	483
eventing-controller-metrics	293
monitoring-alertmanager	286
pilot	282

Getting down the numbers for that top10 jobs is crucial here, in the following some investigations results:

[ ] istio metrics: Using this query shows that for istio the most problamtic queries are the buckets: count by (__name__) ({__name__=~".+", job="istio-system/envoy-stats/0"}) Here, an idea could be to remove the response codes on server side as it is not used in the dashboards, see also: https://github.com/istio/istio/issues/21551. That can be maybe don using the new telemetry API and dropping the labels from the REQUEST_DURATION metric as documented here: https://istio.io/latest/docs/reference/config/telemetry/#MetricSelector-IstioMetric, see https://github.com/kyma-project/kyma/pull/13662
[x] envoy metrics: check if envoy_cluster_upstream_cx_connect_ms_bucket and the two others are used anywhere in dashboards. We dropped full envoy metrics, see https://github.com/kyma-project/kyma/issues/13659
[x] apiserver metrics:
- reduced usage of apiserver_request_duration_seconds_bucket within https://github.com/kyma-project/kyma/issues/13386
- metric apiserver_request_latencies_bucket got removed from the apiserver metrics in b31ce5f. Yet it is used in some kyma specific rules (e.g. k8s.rules. Hence remove any references, see https://github.com/kyma-project/kyma/pull/13758/files
[x] controller metrics: reduced usage of rest_client_request_latency_seconds_bucket within https://github.com/kyma-project/kyma/issues/13386
[x] istio sidecar metrics: investigate metric istio_tcp_connections_closed_total and istio_tcp_connections_opened_total coming form istio sidecar to drop some labels in order to reduce the cardinality as over a period of time can increase the number of unique timeseries in tsdb of proemtheus-istio-server. =>As the labels are all quite static and not changing in relation to each other there seem to be no relevant improvement.
[ ] Following histograms can be improved as well: istio_requests_total, istio_request_bytes_bucket, istio_request_duration_milliseconds_bucket,istio_response_bytes_bucket, see https://github.com/kyma-project/kyma/pull/13662

ghost commented 2 years ago

This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs. Thank you for your contributions.

a-thaler commented 2 years ago

We will stop that first round of optimizations, mainly istio metrics still have potential for improvements

kyma-project / kyma

Optimize metrics footprint of prometheus #13258