Closed a-thaler closed 2 years ago
This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs. Thank you for your contributions.
We will stop that first round of optimizations, mainly istio metrics still have potential for improvements
Description
The current setup of prometheus is a non-scalable setup with limits and these limits needs to be determined and documented well, see also https://github.com/kyma-project/kyma/issues/10033.
However, we already can see that the base metrics footprint is quite too big. To collect such amount of timeseries as a base line will be too expensive and also will stress the current setup without need.
A quote from somewhere:
Typically, a 3-nodes Gardener cluster with Prometheus-Operator will generate ~40K active series. Depending on use cases, you might not actually use all of those metrics; for example, by only sending the series used in the default provided K8S dashboards, you'll be using ~8K active series instead of ~40K active series.
Goal Reduce the footprint to a minimum by dropping all metrics not actively used in dashboards, rules or alerts.
Actions Setting up a kyma cluster and executing the tests, there are already 80.000 series returned using query
prometheus_tsdb_head_series
count by (job) ({__name__=~".+"})
, here the top10 results:Updated list after solving below actions
Getting down the numbers for that top10 jobs is crucial here, in the following some investigations results:
[ ] istio metrics: Using this query shows that for istio the most problamtic queries are the buckets:
count by (__name__) ({__name__=~".+", job="istio-system/envoy-stats/0"})
Here, an idea could be to remove the response codes on server side as it is not used in the dashboards, see also: https://github.com/istio/istio/issues/21551. That can be maybe don using the new telemetry API and dropping the labels from the REQUEST_DURATION metric as documented here: https://istio.io/latest/docs/reference/config/telemetry/#MetricSelector-IstioMetric, see https://github.com/kyma-project/kyma/pull/13662[x] envoy metrics: check if
envoy_cluster_upstream_cx_connect_ms_bucket
and the two others are used anywhere in dashboards. We dropped full envoy metrics, see https://github.com/kyma-project/kyma/issues/13659[x] apiserver metrics:
apiserver_request_duration_seconds_bucket
within https://github.com/kyma-project/kyma/issues/13386apiserver_request_latencies_bucket
got removed from the apiserver metrics in b31ce5f. Yet it is used in some kyma specific rules (e.g. k8s.rules. Hence remove any references, see https://github.com/kyma-project/kyma/pull/13758/files[x] controller metrics: reduced usage of
rest_client_request_latency_seconds_bucket
within https://github.com/kyma-project/kyma/issues/13386[x] istio sidecar metrics: investigate metric
istio_tcp_connections_closed_total
andistio_tcp_connections_opened_total
coming form istio sidecar to drop some labels in order to reduce the cardinality as over a period of time can increase the number of unique timeseries in tsdb of proemtheus-istio-server. =>As the labels are all quite static and not changing in relation to each other there seem to be no relevant improvement.[ ] Following histograms can be improved as well:
istio_requests_total
,istio_request_bytes_bucket
,istio_request_duration_milliseconds_bucket
,istio_response_bytes_bucket
, see https://github.com/kyma-project/kyma/pull/13662