knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.54k stars 1.15k forks source link

Metrics cardinality is too high #11248

Closed skonto closed 2 years ago

skonto commented 3 years ago

/area monitoring

When using Prometheus a standard principle is to have metrics with low cardinality but also as a key concept in monitoring in general. Low cardinality is a key design principle in latest standards too. Although Prometheus made steps to make things more flexible in the past, when it comes to configuring mem current versions enforce you to limit the time-series ingested to tune mem implicitly, the memory consumption Knative metrics impose is not low. One way to calculate the memory is using this calculator . Right now we have a lot of metrics which use as a label the revision name, config name, pod name and namespace name. For example to mention a few: activator (request_latencies), autoscaler (reconsiler) time series have a complexity of: #histogram_buckets#revision#ns webhook emits similar histogram metrics and depends on number of kinds and namespaces. To understand the scale if we use 30 buckets (aggregated from several histograms), 100 services and 50 namespaces this means 150K timeseries from one pod. We have several pods and no Eventing is added in the picture where we have high cardinality due to event_type, filter_type etc. In the calculator above 1M of time serties with specific assumption needs around 4GB of memory. Given the number of pods we use we can easily reach that number. We already face this downstream. Here is a sample status report for the top series on Prometheus when using 100 services: image

Also note that we havent taken into consideration the scenario where a pod name changes due to a restart (it can happen easily). A Prometheus instance is not meant to serve only Knative so in general we should tune our metrics api. I propose we limit our labels to the namespace level not per revision. Logging should be used to understand the behavior of individual services not metrics. Also we need to reconsider histograms for the webhook and controller cases, buckets make cardinality explode.

What version of Knative?

All versions

Expected Behavior

Metrics should have low cardinality.

Actual Behavior

Excessive number of time series are created.

Steps to Reproduce the Problem

Create a moderate number of namespaces and ksvcs.

/cc @evankanderson @mattmoor @markusthoemmes

evankanderson commented 3 years ago

Is part of this that Prometheus doesn't have a native histogram mechanim?

In my experience, histograms (compared with mean/max) are very useful for understanding the frequency of outliers and (for example) bimodal distributions on events.

I'm slightly confused by this math:

Right now we have a lot of metrics which use as a label the revision name, config name, pod name and namespace name. For example to mention a few: activator (request_latencies), autoscaler (reconsiler) time series have a complexity of: #histogram_buckets#revision#ns webhook emits similar histogram metrics and depends on number of kinds and namespaces. To understand the scale if we use 30 buckets (aggregated from several histograms), 100 services and 50 namespaces this means 150K timeseries from one pod.

Are you suggesting 100 services each in 50 namespaces (i.e. a total of 5000 services)? Having a single process manage 5000 services does seem a bit high.

30 buckets also seems a bit high -- if we assume 3 points per power of 10 (1, 2, 5, 10), then 30 buckets would support a range of 10^10 (1us to 3 hours, for example). If we constrain things to 1ms to 1 minute, then we end up with the following buckets: 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, infinity (16 buckets)

I agree that we should probably be thoughtful about our cardinality. In this case, activator latency is probably interesting for Revisions which could receive a request (i.e. which have a Route targeting them). At a given moment, I'd expect the ratio between "live Revision" and "Service" to be 2x or less over a large population, but over time the number of historical Revisions which aren't currently emitting data will grow; I'm not sure how Prometheus handles that type of cardinality.

Short answer:

skonto commented 3 years ago

@evankanderson In our original issue we had: 17 (webhook buckets) 54 (namespaces) 32 (kinds) = 29376 time series which was a bit too much compared to the api server ones at that point in time.

Are you suggesting 100 services each in 50 namespaces (i.e. a total of 5000 services)? Having a single process manage 5000 services does seem a bit high.

There are buckets from different histograms eg reconciler (5), activator latency (25). This is a rough estimation.

If you lower the cardinality here then it should not be a problem. The rule of thumb is not to have more than 10 distinct time series per metric at least for systems which dont support high cardinality (Prometheus is a major one). Google cloud has its own limits too (200k time series per resources), I am wondering what is your experience so far eg for revisions in practice.

We don't have a "map" or "budget" for metrics output; my preference would be to have something like the OTel collector provide filtering for cases where our default metrics are too expensive; this could either be dropping buckets (e.g. 1,10,100 instead of 1,2,5,10) or dropping whole timeseries.

This is super critical because for example in Eventing there is a big issue when a service goes up and down as here because there is a tag called name that specifies the target service: image

image

image

This creates continuously new time series which is obviously not ok.

About the aggregation level yes we need to discuss it at least capture it correctly in the new metrics API but I hope to come up with a fix in the mean time because I suspect a lot of users will face this issue.

skonto commented 3 years ago

@vaikas wdyth?

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.