Some Synced PrometheusRules are missing external labels

rknightion commented 1 year ago

What's wrong?

When making use of mimir.rules.kubernetes to sync PrometheusRules, some rules are syncing with the expected externallabels applied during the remote write step whereas others are completely missing these externallabels.

As a result, our eventual groupings in Mimir Alertmanager miss some of these rules.

So far the main thing I've noticed between the two sets of rules is that "built in" rules (from kubernetes-mixin for example) seem to retain the labels whereas ones created by our own helm charts or via other upstream projects do not.

These labels do not appear in any of the PrometheusRule CRDs themselves so I assume they are being added by the remote-write capability and that if there is a bug it's that the externallabels aren't being applied to all synced rules.

Steps to reproduce

Set up mimir.rules.kubernetes to scrape all PrometheusRules along with remotewrite external labels

mimir.rules.kubernetes "default" {
  address = "https://mimir-gw"
  mimir_namespace_prefix = "cluster-name"
  tenant_id = ""
  basic_auth {
    username = ""
    password = ""
  }
}

prometheus.remote_write "grafana_cloud_prometheus" {
  endpoint {
    url = nonsensitive(local.file.prometheus_host.content) + "/api/v1/push"
    headers = { "X-Scope-OrgID" = local.file.prometheus_tenantid.content }

    basic_auth {
      username = local.file.prometheus_username.content
      password = local.file.prometheus_password.content
    }

  }
  external_labels = {
    customer = "Internal",
    envtype = "prod",
    mimircluster = "pop-prod",
    product = "MyProduct",
    cluster = "pop-prod",
  }
}

When it is run, some synced rules have the external labels and some do not. In the screenshot from the alert grouping page, the alerts at the top are ungroupped despite coming from the same clusters as the groupped alerts. The labels customer, envtype, mimircluster, product and cluster have been added to the synced alerts in the bottom half but not the ones at the top. This behaviour difference I think is an indication of a bug or divergence in behaviour that I can't see documented.

Screenshot 2023-09-08 at 12 41 31

System information

EKS 1.24

Software version

v0.35.2 provided by the k8s-monitoring helm chart

Configuration

No response

Logs

No errors in logs

hainenber commented 1 year ago

Can you help checking if custom metrics triggered by external_labels-missing alerts also don't have the aforementioned labels? If not, could you add another prometheus.scrape section to send scraped metrics to the prometheus.remote_write one?

IMO, the "built-in" metrics are possibly scraped by prometheus.remote_write block and hence are inherited the external_labels. The labels you're seeing are probably from the triggering metrics, as claimed by Prometheus's Alerting Rule doc

rknightion commented 1 year ago

@hainenber I've checked the underlying metrics that triggered both sets of alerts and they all seem have all of the external_labels (which is one of the things confusing me as if the underlying metrics have the labels I would have expected the alerts to have them also?)

rfratto commented 1 year ago

@rknightion Can you provide one of the alert definitions for one of the alerts where you do not see the expected external_labels?

If the alert rule is aggregating away the external_labels, then they wouldn't appear in the alerts.

github-actions[bot] commented 1 year ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

rfratto commented 6 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

grafana / alloy