Server-side stored views, functions or macros for simplifying repetitive queries

ringerc commented 2 months ago

Proposal

Problem

PromQL is a verbose language, especially when there's widespread use of label matching on info-metrics. Which is necessary to control TSDB index size and Prometheus memory use. This tends to lead to alert rules, monitoring applications, and dashboards containing a lot of very verbose boilerplate PromQL for common tasks like "add labels discovered from kube_pod_labels to the result timeseries".

The resulting repetitive PromQL scattered across multiple app configurations, dashboards, etc makes it very difficult to evolve and change metrics labels, rename time-series, etc. A mistake or bug in a PromQL query often gets cargo-culted across into hundreds of other queries.

Prometheus appears to lack any server-side way of storing and re-using such query fragments without materialising them into the TSDB as concrete time-series - which somewhat defeats the purpose of maintaining and joining on info-metrics.

Proposal

I propose the addition of server-side reusable PromQL query fragments. These fragments would be defined in Prometheus's configuration file and re-read on server configuration reload, much like recording rules and alerting rules.

The simplest way to implement these might be as "non-recording rules" - a name that is macro-expanded at PromQL parse time into the configured expression.

These rules would optionally be parameterised with named arguments, which expand into $variables inside the rule, with an optional default if the value is not supplied. The arguments will be supplied using the existing selector syntax. This is necessary because PromQL's query executor lacks filter-condition push-down capability so label filters usually have to be repeated across the selectors of all metrics before label-matching is done.

Example

Consider an inventory query that exposes an application health metric and associates it with a set of relevant labels identifying the workload.

For this purpose I'll use the cnpg_collector_up metric from CloudNativePG, but any metric would do really.

This metric is to be enriched with additional Prometheus labels obtained from Kubernetes Pod labels and metadata, such as the kube node the workload is running on. This requires a large amount of PromQL boilerplate. And that boilerplate must also be qualified with repetitions of filters to restrict the subset of time-series that will be input into the label match to avoid exceeding Prometheus's available memory and OOMing the entire Prometheus server (or, if a federated tool like Thanos is used, to ensure proper tenant selection and routing). This means that a bunch of label selectors get repeated throughout the query too. For example purposes I'll use a kube_cluster label.

Workload labels that come from the underlying k8s cluster resources (in this case via kube-state-metrics) include:

Pod: cnpg_cluster_id, cnpg_instance_id, cnpg_instance_role, someapp_project_id, someapp_resource_id, pgd_group, pgd_cluster

Example:

# This aggregation drops unwanted labels, since PromQL lacks a proper label_drop(...) function to drop non-cardinal labels that would error on non-unique series instead of summing them.
sum without(endpoint,instance,job,postgresql,role,prometheus,cluster,container,uid) (
    # This is the actual metric we're interested in
    cnpg_collector_up{kube_cluster="EXAMPLE_KUBE"}
    # join on kube_pod_labels for project-id, PGD info, etc
    * on (uid)
    group_left(cnpg_cluster_id,cnpg_instance_id,cnpg_instance_role,someapp_project_id,someapp_resource_id,pgd_group
    ,pgd_cluster)
    # enrich with kube Pod label info from kube-state-metrics pod info metric.
    # note the group_by (...) expression repeats the labels in both the on (...) join key and
    # the subject-labels in group_left(...). This protects against issues where added or unrelated
    # labels that aren't of interest can churn. It's probably safe to write
    # group ignoring(container,instance,job=)
    # in this case, but better to make the query robust:
    group by (uid, cnpg_cluster_id,cnpg_instance_id,cnpg_instance_role,someapp_project_id,someapp_resource_id,pgd_group
    ,pgd_cluster) (
        kube_pod_labels{kube_cluster="EXAMPLE_KUBE"}
    )
    # join on kube_pod_info for the node hosting the pod and the pod ip address. Data from kube-state-metrics pod info metric.
    * on (uid)
    group_left(pod_ip,node)
    group by (uid, pod_ip, node) (
        kube_pod_info{kube_cluster="EXAMPLE_KUBE"}
    )
    # join on kube_pod_container_info for the container image. Note that we join on container_id too; we could
    # instead filter for {container="postgres"} but joining on the uid is safer and guaranteed to give a unique
    # result. Data from kube-state-metrics pod info metric.
    * on (uid,container_id)
    group_left(image_spec,image_id)
    group by (uid,container_id,image_spec,image_id) (
        kube_pod_container_info{kube_cluster="EXAMPLE_KUBE"}
    )
    # join on kube_pod_annotations for someapp remote replica cluster id, if any
    # which is projected by kube-state-metrics from the Pod
    * on (uid)
    group_left(someapp_replica_source_cluster_id)
    group by (uid,someapp_replica_source_cluster_id) (
        kube_pod_annotations{kube_cluster="EXAMPLE_KUBE"}
    )
)

It's very verbose, but it's not that bad. Until you then have applications and dashboards that want to query and filter other metrics based on these workload labels.

If I have a set of 10 dashboard queries that all want to be able to filter the displayed set on the values of cnpg_cluster_id, someapp_project_id, someapp_resource_id, and/or someapp_replica_source_cluster_id these long info-metric label-matching expressions start to propagate across the codebase. Often into alert rules, multiple other apps, etc.

This could be factored out into a recording rule, but it's still verbose to join on even if it's only one time-series. Updates to it are delayed. And it generates a lot of new entries in the TSDB's inverted indexes, as well as costing extra memory at query time. All to duplicate data that's already in the TSDB. So expanding these sorts of expressions into recording rules isn't particularly desirable.

This proposal seeks to turn the above into something like a reusable expression in a configured rule, and a simplified query that references it, e.g. this "view" definition in Prometheus configuration:

groups:
  - name: view_example
    rules:
      - view: kube_pod_workload_info        # <--- new "view" rule kind
        args:                 # <-- Takes named arguments with defaults
          kube_cluster:
            default: ".*"
            description: Filter by k8s cluster name. Strongly recommended.
            required: false # args may be marked required; if omitted, expansion will fail with an error
          uid:
            default: ".*"
            description: Filter by pod UID
          pod:
            # "default: .*" will be the implied default if omitted
            description: Filter by pod name
          namespace:
            description: Filter by kube pod namespace
          node:
            description: Filter by kube node name
         cnpg_cluster_id:
           description: Filter by CNPG cluster ID
         container:
           description: Filter by pod container name. Required to ensure unique matching of kube_pod_container_info because PromQL doesn't support many-to-many joins.
           required: true
        expr: |                 # <----  Expression that will be expanded into the invoking query and have arguments substituted
            group by (kube_cluster, namespace, pod, uid, cnpg_cluster_id, cnpg_instance_id, cnpg_instance_role, someapp_project_id, someapp_resource_id, pgd_group, pgd_cluster) (
                # Initial data filtering including cnpg_cluster_id happens here
                kube_pod_labels{kube_cluster=$kube_cluster, namespace=$namespace, cnpg_cluster_id=$cnpg_cluster_id, uid=$uid, pod=$pod}
            )
           # Enrich with pod IP and pod kube node name
            * on (kube_cluster, uid)
            group_left(pod_ip, node)
            group by (kube_cluster, uid, pod_ip, node) (
                # Node filter applied here if set
                # Other filters are repeated here because PromQL won't do predicate push-down, so filtering in the
                # metric selector reduces the amount of data materialized into memory before label matching and thus
                # reduces the amount of RAM that must be allocated to Prometheus to stop it OOMing on large queries.
                kube_pod_info{kube_cluster=$kube_cluster, namespace=$namespace, uid=$uid, pod=$pod, node=$node}
            )
            # Enrich with pod container id, image id, image spec
            * on (kube_cluster, uid)
            group_left(image_spec, image_id, container, container_id)
            group by (kube_cluster, uid, image_spec, image_id, container, container_id) (
                # container filter for container name is applied here
                # It is required because PromQL can't do many-to-many joins, and non-unique matches by container would
                # generate multiple timeseries.
                kube_pod_container_info{kube_cluster=$kube_cluster, namespace=$namespace, pod=$pod, uid=$uid, container=$container}
            )
            # enrich with someapp_replica_source_cluster_id
            * on (kube_cluster, uid)
            group_left(someapp_replica_source_cluster_id)
            group by (kube_cluster, uid, someapp_replica_source_cluster_id) (
                kube_pod_annotations{kube_cluster=$kube_cluster, namespace=$namespace, pod=$pod, uid=$uid}
            )

would be invoked like this in PromQL:

sum without(endpoint,instance,job,uid,container) (
    # This is the actual metric we're interested in
    cnpg_collector_up{kube_cluster="EXAMPLE_KUBE"}
    # enrich with workload metadata for inventory
    * on (uid)
    group_left(namespace, cnpg_cluster_id, cnpg_instance_id, cnpg_instance_role, someapp_project_id, someapp_resource_id, pgd_group, pgd_cluster)
    kube_pod_workload_info{kube_cluster="EXAMPLE_KUBE"}     # <-- Expands to the "kube_pod_workload_info" view from the config, with parameters bound and expanded
)

Now if I want to define alert rules, additional dashboards, etc, that filter on or display those same workload labels it becomes a simple, reusable job to collect them and attach the workload metadata needed for users to understand which workload is affected - though it might get very memory-expensive if the PromQL executor has no means of doing a nested-loop join:

# Find down postgres instances across all kube clusters
(cnpg_collector_up{} == 0)
* on (kube_cluster, uid)
  group_left(kube_cluster, namespace, cnpg_cluster_id, cnpg_instance_id, cnpg_instance_role, someapp_project_id, someapp_resource_id, pgd_group, pgd_cluster)
  kube_pod_workload_info{}

# get cpu usage for a specific cnpg cluster workload's pods
container_cpu_usage_seconds_total
* on (kube_cluster, uid)
  group_left(kube_cluster, namespace, cnpg_cluster_id, cnpg_instance_id, cnpg_instance_role, someapp_project_id, someapp_resource_id, pgd_group, pgd_cluster)
  kube_pod_workload_info{cnpg_cluster_id="$some_cluster_id_here"}

Potential problems

The Prometheus query executor does not appear (based on docs and on my reading so far) to support the use of different strategies for label matching, re-ordering of label matching (joins) for efficiency, nor support predicate push-down into metric selectors and subqueries. These features are found in most relational database execution engines but there doesn't seem to be any equivalent in Prometheus.

Without some kind of dynamic query planning it's necessary to hand-tune the order of expressions in individual queries to minimise the width of data that must be materialized in memory before label-matching is performed. So incautious use of these proposed views could make the existing problems with managing Prometheus OOMs on large queries worse. This can be managed somewhat by careful manual selection of join order in the view, and liberal use of explicit filters in the selectors of all metrics consumed in a given view expression.

The resource use issues are compounded by the apparent lack of any sort of admission control or query-level resource limitation in Prometheus - it has a sample count limit storage.remote.read-sample-limit & query.max-samples, but no memory accounting for the labels associated with those samples, so "wide" samples can blow out the limit while reasonable queries on "narrow" samples could fail because of the sample count limit. It will try to run everything until a memory allocation fails, then OOM or panic, forcing a restart of the entire Prometheus instance if available RAM is exhausted. Depending on deployment architecture this may interrupt sample ingestion. So memory-hungry queries can accidentally DoS the Prometheus service. The proposed views feature might make it easier to accidentally write more memory-expensive hungry queries, so careful guidelines for its use would be required.

Alternatives

User-defined golang functions

Per https://github.com/prometheus/prometheus/issues/4522 there's discussion of supporting user-defined functions in PromQL. These could provide an alternative extension point. But there's no supported, robust way to do this, and it'd likely be a much more verbose and hard to maintain way to handle simple query reuse.

Use recording rules

Using a regular recording rule for these purposes can work, but it wastes TSDB storage on more unnecessary info-metric samples - and more importantly, the resulting wide, high-cardinality info-metrics have a large impact on the TSDB's inverted index size and thus retrieval efficiency. Smaller indexes good, bigger indexes bad.

Prometheus does not appear to support the creation of user-defined indexes for subsets of data, nor does it have a query planner that could make use of such indexes if they existed. So there's no way to make a separate set of indexes for specific wide metrics and avoid "polluting" the main indexes with their entries.

Recording rules also introduce delays for visibility of the data, which a simple expansion system would not.

And recording rules' definitions do not change retroactively (without complex manual backfilling activity) so there's a long delay between when a rule is updated and when the changes have taken effect on a reasonable look-back window into the data of interest. Queries must be carefully written to account for this.

Use PromQL subqueries

A PromQL subquery is a bit like an inline recording rule.

`with` or `let` statements

https://github.com/prometheus/prometheus/issues/6146 proposes a "with" or "let" syntax to allow part of a query to be factored out and re-used, whether it's a scalar or a complex subquery or group expression.

This would be very valuable, but would not solve the sharing problem across queries, such as in alert rules. It would however go very well with this proposed feature, as one might write let some_common_query = server_side_query('some_common_query_fragment') as a way of loading it and sharing the syntax.

prometheus / prometheus