Open ringerc opened 2 months ago
Two notes here (not meant to be conclusive for the bigger topic as a whole):
Thanks very much @beorn7 . That makes sense.
Really glad to see possible improvements in info-metric usability coming though the current proposal's hardcoded assumption that the join-key labels are job,instance
severely limits its utility.
Left a comment here: https://github.com/prometheus/prometheus/pull/14495#issuecomment-2392624864 . The proposed feature has functionality that would help solve problems with staleness and enumeration-style metric querying described in https://github.com/prometheus/prometheus/issues/11132 too.
It just occurred to me that a simple approach to handling the metric repetition might be possible to implement this as a prometheus query proxy - which could be deployed as a sidecar on Prometheus workloads in kube. Like https://github.com/prometheus-community/prom-label-proxy, but parsing the PromQL and expanding placeholder expressions instead.
That might be a reasonable way to explore the feasibility of this. I don't expect to have time to bang out a prototype for it in a hurry but it looks like a fun project.
Also related: https://github.com/prometheus/prometheus/issues/13625 since the repetitive nature of PromQL becomes a particular problem when emulating left joins.
See also related https://github.com/prometheus/prometheus/issues/6146 for a request for a let/where expression that would help with some of the repetition. And the MetricsQL WITH templates feature (https://victoriametrics.com/promql/expand-with-exprs) .
It'd make sense to implement server-side query fragments by providing a means of "loading" a query fragment like a WITH template
WITH saved_query(my_canned_query)
my_metric
* on (join_labels) group_left(info_labels) my_canned_query
Proposal
Problem
PromQL is a verbose language, especially when there's widespread use of label matching on info-metrics. Which is necessary to control TSDB index size and Prometheus memory use. This tends to lead to alert rules, monitoring applications, and dashboards containing a lot of very verbose boilerplate PromQL for common tasks like "add labels discovered from
kube_pod_labels
to the result timeseries".The resulting repetitive PromQL scattered across multiple app configurations, dashboards, etc makes it very difficult to evolve and change metrics labels, rename time-series, etc. A mistake or bug in a PromQL query often gets cargo-culted across into hundreds of other queries.
Prometheus appears to lack any server-side way of storing and re-using such query fragments without materialising them into the TSDB as concrete time-series - which somewhat defeats the purpose of maintaining and joining on info-metrics.
Proposal
I propose the addition of server-side reusable PromQL query fragments. These fragments would be defined in Prometheus's configuration file and re-read on server configuration reload, much like recording rules and alerting rules.
The simplest way to implement these might be as "non-recording rules" - a name that is macro-expanded at PromQL parse time into the configured expression.
These rules would optionally be parameterised with named arguments, which expand into
$variables
inside the rule, with an optional default if the value is not supplied. The arguments will be supplied using the existing selector syntax. This is necessary because PromQL's query executor lacks filter-condition push-down capability so label filters usually have to be repeated across the selectors of all metrics before label-matching is done.Example
Consider an inventory query that exposes an application health metric and associates it with a set of relevant labels identifying the workload.
For this purpose I'll use the
cnpg_collector_up
metric from CloudNativePG, but any metric would do really.This metric is to be enriched with additional Prometheus labels obtained from Kubernetes
Pod
labels and metadata, such as the kube node the workload is running on. This requires a large amount of PromQL boilerplate. And that boilerplate must also be qualified with repetitions of filters to restrict the subset of time-series that will be input into the label match to avoid exceeding Prometheus's available memory and OOMing the entire Prometheus server (or, if a federated tool like Thanos is used, to ensure proper tenant selection and routing). This means that a bunch of label selectors get repeated throughout the query too. For example purposes I'll use akube_cluster
label.Workload labels that come from the underlying k8s cluster resources (in this case via kube-state-metrics) include:
Pod
:cnpg_cluster_id
,cnpg_instance_id
,cnpg_instance_role
,someapp_project_id
,someapp_resource_id
,pgd_group
,pgd_cluster
Example:
It's very verbose, but it's not that bad. Until you then have applications and dashboards that want to query and filter other metrics based on these workload labels.
If I have a set of 10 dashboard queries that all want to be able to filter the displayed set on the values of
cnpg_cluster_id
,someapp_project_id
,someapp_resource_id
, and/orsomeapp_replica_source_cluster_id
these long info-metric label-matching expressions start to propagate across the codebase. Often into alert rules, multiple other apps, etc.This could be factored out into a recording rule, but it's still verbose to join on even if it's only one time-series. Updates to it are delayed. And it generates a lot of new entries in the TSDB's inverted indexes, as well as costing extra memory at query time. All to duplicate data that's already in the TSDB. So expanding these sorts of expressions into recording rules isn't particularly desirable.
This proposal seeks to turn the above into something like a reusable expression in a configured rule, and a simplified query that references it, e.g. this "view" definition in Prometheus configuration:
would be invoked like this in PromQL:
Now if I want to define alert rules, additional dashboards, etc, that filter on or display those same workload labels it becomes a simple, reusable job to collect them and attach the workload metadata needed for users to understand which workload is affected - though it might get very memory-expensive if the PromQL executor has no means of doing a nested-loop join:
Potential problems
The Prometheus query executor does not appear (based on docs and on my reading so far) to support the use of different strategies for label matching, re-ordering of label matching (joins) for efficiency, nor support predicate push-down into metric selectors and subqueries. These features are found in most relational database execution engines but there doesn't seem to be any equivalent in Prometheus.
Without some kind of dynamic query planning it's necessary to hand-tune the order of expressions in individual queries to minimise the width of data that must be materialized in memory before label-matching is performed. So incautious use of these proposed views could make the existing problems with managing Prometheus OOMs on large queries worse. This can be managed somewhat by careful manual selection of join order in the view, and liberal use of explicit filters in the selectors of all metrics consumed in a given view expression.
The resource use issues are compounded by the apparent lack of any sort of admission control or query-level resource limitation in Prometheus - it has a sample count limit
storage.remote.read-sample-limit
&query.max-samples
, but no memory accounting for the labels associated with those samples, so "wide" samples can blow out the limit while reasonable queries on "narrow" samples could fail because of the sample count limit. It will try to run everything until a memory allocation fails, then OOM or panic, forcing a restart of the entire Prometheus instance if available RAM is exhausted. Depending on deployment architecture this may interrupt sample ingestion. So memory-hungry queries can accidentally DoS the Prometheus service. The proposed views feature might make it easier to accidentally write more memory-expensive hungry queries, so careful guidelines for its use would be required.Alternatives
User-defined golang functions
Per https://github.com/prometheus/prometheus/issues/4522 there's discussion of supporting user-defined functions in PromQL. These could provide an alternative extension point. But there's no supported, robust way to do this, and it'd likely be a much more verbose and hard to maintain way to handle simple query reuse.
Use recording rules
Using a regular recording rule for these purposes can work, but it wastes TSDB storage on more unnecessary info-metric samples - and more importantly, the resulting wide, high-cardinality info-metrics have a large impact on the TSDB's inverted index size and thus retrieval efficiency. Smaller indexes good, bigger indexes bad.
Prometheus does not appear to support the creation of user-defined indexes for subsets of data, nor does it have a query planner that could make use of such indexes if they existed. So there's no way to make a separate set of indexes for specific wide metrics and avoid "polluting" the main indexes with their entries.
Recording rules also introduce delays for visibility of the data, which a simple expansion system would not.
And recording rules' definitions do not change retroactively (without complex manual backfilling activity) so there's a long delay between when a rule is updated and when the changes have taken effect on a reasonable look-back window into the data of interest. Queries must be carefully written to account for this.
Use PromQL subqueries
A PromQL subquery is a bit like an inline recording rule.
with
orlet
statementshttps://github.com/prometheus/prometheus/issues/6146 proposes a "with" or "let" syntax to allow part of a query to be factored out and re-used, whether it's a scalar or a complex subquery or group expression.
This would be very valuable, but would not solve the sharing problem across queries, such as in alert rules. It would however go very well with this proposed feature, as one might write
let some_common_query = server_side_query('some_common_query_fragment')
as a way of loading it and sharing the syntax.See also
Server-side views, functions, macros, user-defined functions, and other extension points:
Resource control: