grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.35k stars 188 forks source link

Mimir.rules.kubernetes keeps deleting and re-creating rules on grafana cloud #200

Open Lp-Francois opened 6 months ago

Lp-Francois commented 6 months ago

What's wrong?

I am using the block mimir.rules.kubernetes with the latest version of the grafana agent (flow mode) docker.io/grafana/agent:v0.40.3.

It uploads my PrometheusRule to Grafana Cloud remote mimir instance, but from the UI I can see my alerts being constantly deleted and then recreated, where it alternates between the 3 states. Here are 3 screenshots:

Screenshot 2024-03-29 at 08 53 05 Screenshot 2024-03-29 at 08 53 29 Screenshot 2024-03-29 at 08 54 31

Steps to reproduce

  1. Install the agent using helm in a Kubernetes cluster
  2. use this in the values.yaml:
        extraConfig: |-
          // documentation: https://grafana.com/docs/agent/latest/flow/reference/components/mimir.rules.kubernetes/
          mimir.rules.kubernetes "default" {
            // the secret needs to be referenced by a remote.kubernetes.secret block (done by the config in externalServices)
            address = nonsensitive(remote.kubernetes.secret.logs_service.data["MIMIR_ADDRESS"])
            basic_auth {
              username = nonsensitive(remote.kubernetes.secret.logs_service.data["MIMIR_TENANT_ID"])
              password = remote.kubernetes.secret.logs_service.data["MIMIR_API_KEY"]
            }
          }
  1. Add a PrometheusRule in a created namespace, containing several alerts.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    role: alert-rules
  name: my-api-prometheus
  namespace: pr-xxxx-yyyy
spec:
  groups:
    - name: alerts-my-api
      rules:
        - alert: BlackboxProbeFailed
          annotations:
            description: Service my-api is down for more than 2 minutes.
            summary: my-api API is down!
          expr: probe_success{service="my-api"} == 0
          for: 2m
          labels:
            service: my-api
            severity: warning
        - alert: KubernetesPodCrashLooping
          annotations:
            description: Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping.
            summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
          expr: |-
            increase(
              kube_pod_container_status_restarts_total{pod=~"my-api.*", namespace="pr-xxxx-yyyy"}[1m]
            ) > 3
          for: 2m
          labels:
            service: my-api
            severity: warning

System information

Agent is running on Linux amd64 t3a.medium (AWS - EKS)

Software version

agent:v0.40.3

Configuration

No response

Logs

ts=2024-03-29T08:43:00.958144643Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.kubelet duration=319.949µs
ts=2024-03-29T08:43:00.95898935Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.cadvisor duration=2.616273ms
ts=2024-03-29T08:43:00.959479126Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.cadvisor duration=428.004µs
ts=2024-03-29T08:43:06.164666761Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:43:35.99228665Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:43:36.092533519Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:44:06.137868933Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:44:35.992238997Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:44:36.074239084Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:44:40.957206227Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.kubelet duration=585.428µs
ts=2024-03-29T08:44:40.957614201Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.kubelet duration=280.44µs
ts=2024-03-29T08:44:40.958029334Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.cadvisor duration=1.418136ms
ts=2024-03-29T08:44:40.958610323Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.cadvisor duration=475.486µs
ts=2024-03-29T08:44:46.237538954Z level=info msg="processing event" component=mimir.rules.kubernetes.default type=resource-changed key=pr-2646-unify-docker-postgres-in-1/authentication-prometheus
ts=2024-03-29T08:44:46.340554683Z level=info msg="updated rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:45:06.096739738Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:45:35.992486801Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:45:36.112461898Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:46:05.957054716Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.kubelet duration=635.101µs
ts=2024-03-29T08:46:05.957656296Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.kubelet duration=501.917µs
ts=2024-03-29T08:46:05.958030929Z level=info msg="finished node evaluation" controller_id="" node_id=discovery.relabel.cadvisor duration=1.640294ms
ts=2024-03-29T08:46:05.958501184Z level=info msg="finished node evaluation" controller_id="" node_id=prometheus.scrape.cadvisor duration=265.988µs
ts=2024-03-29T08:46:06.100316322Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:46:35.992353439Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:46:36.079171221Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:47:06.093045379Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
ts=2024-03-29T08:47:35.992322813Z level=info msg="rejoining peers" peers=10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80,10-0-13-9.grafana-k8s-monitoring-grafana-agent-cluster.monitoring.svc.cluster.local.:80
ts=2024-03-29T08:47:36.107329395Z level=info msg="added rule group" component=mimir.rules.kubernetes.default namespace=agent/pr-2646-unify-docker-postgres-in-1/authentication-prometheus/f71bd154-6cd1-4749-b25e-0b1a3eb5ecbf group=alerts-authentication
rfratto commented 6 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

Lp-Francois commented 6 months ago

Okay thanks for your message @rfratto :)

github-actions[bot] commented 5 months ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

juupas commented 4 months ago

We are observing this same thing happening with our Alloy and self-hosted Mimir.

rfratto commented 4 months ago

Sorry for the delay on an update here. Clustering Alloy instances is usually the source of the issue here, where multiple Alloy instances are fighting over which instance should be writing the rules. With the 1.1 release of Alloy, mimir.rules.kubernetes is clustering-aware and avoids this issue:

Alloy version 1.1 and higher supports clustered mode in this component. When you use this component as part of a cluster of Alloy instances, only a single instance from the cluster will update rules using the Mimir API.

This fix will be backported to Grafana Agent in the near future.

If you are not using clustering, double check to see that there aren't multiple Alloy instances running and syncroninzing the same PrometheusRule resources with Mimir.

juupas commented 3 months ago

I'm not using clustering, and there is only one Alloy instance having 2 separate mimir.rules.kubernetes configured.

Alloy v1.1.1 in use.

The Mimir ruler logs have these kind of things:

ts=2024-06-17T12:52:08.732120906Z caller=ruler.go:564 level=info msg="syncing rules" reason=periodic ts=2024-06-17T12:56:58.749657873Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T12:56:58.952421482Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-alerts%2Fb0a4da42-9f74-4ff7-876c-5ee63ba12173 ts=2024-06-17T12:56:58.954207552Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-rules%2Fade2d112-54fe-4a69-865d-9a67eef2f6ad ts=2024-06-17T12:57:08.749366526Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:01:47.142018894Z caller=spanlogger.go:109 method=API.ListRules user=anonymous level=info msg="no rule groups found" userID=anonymous ts=2024-06-17T13:01:58.749370709Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:01:58.920564371Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-alerts%2Fc2cd8835-9964-4834-84bd-e01211dfb7c8 ts=2024-06-17T13:01:58.920796737Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-rules%2F269bcf08-7829-4efa-a45f-2fdefc2f37ac ts=2024-06-17T13:02:57.830975946Z caller=ruler.go:564 level=info msg="syncing rules" reason=periodic ts=2024-06-17T13:06:58.749411084Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:06:58.942997428Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-alerts%2Fb0a4da42-9f74-4ff7-876c-5ee63ba12173 ts=2024-06-17T13:06:58.944327752Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Fmimir-rules%2Fade2d112-54fe-4a69-865d-9a67eef2f6ad ts=2024-06-17T13:11:58.749848971Z caller=ruler.go:564 level=info msg="syncing rules" reason=api-change ts=2024-06-17T13:11:58.923523066Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-alerts%2Fc2cd8835-9964-4834-84bd-e01211dfb7c8 ts=2024-06-17T13:11:58.923779667Z caller=mapper.go:166 level=info msg="updating rule file" file=/data/anonymous/alloy%2Fmonitoring%2Floki-rules%2F269bcf08-7829-4efa-a45f-2fdefc2f37ac

iarlyy commented 3 months ago

@rfratto I am experiencing the same.

I disabled clustering, set sts replicas to 1, however alloy keeps recreating the rules:

ts=2024-07-01T14:27:14.081313222Z level=info msg="removed rule group" component_path=/ component_id=mimir.rules.kubernetes.grafana_mimir namespace=alloy/default/<redacted>/7cc51093-3400-4e49-bb15-910a5b0e2076 group=<redacted>
ts=2024-07-01T14:27:14.211113442Z level=info msg="added rule group" component_path=/ component_id=mimir.rules.kubernetes.grafana_mimir namespace=alloy/default/<redacted>/7cc51093-3400-4e49-bb15-910a5b0e2076 group=<redacted>

i am running with 1.1.x

alloy, version v1.1.0 (branch: HEAD, revision: cf46a1491)
  build user:       root@buildkitsandbox
  build date:       2024-05-14T21:07:39Z
  go version:       go1.22.3
  platform:         linux/amd64
  tags:             netgo,builtinassets,promtail_journal_enabled
iarlyy commented 3 months ago

Alright, i think that i found what is causing this never ending loop of recording rules recreation:

I have alloy installed in multiple clusters, I enabled mimir.rules in all of those clusters, however all of them communicate with a central mimir ruler (single tenant).

I noticed that cluster A's alloy is deleting cluster B's recording rules and vice-versa, as well as they try to recreate only the rules that exist in the local state.

https://github.com/grafana/alloy/blob/5d7b707eafe3096e1e477cda600fac8e976f4734/internal/component/loki/rules/kubernetes/events.go#L105

is there a correct configuration for this setup when not using multiple tenants?

rfratto commented 3 months ago

@56quarters ^ Do the Mimir folks have any opinions about how this should be handled from clients?

56quarters commented 3 months ago

I believe the mimir_namespace_prefix option is intended to fix the case where you have multiple clusters, each with their own Alloy setup, talking to a single central Mimir.

In your case @iarlyy I think you'd want something like this:

Cluster A config:

mimir.rules.kubernetes "local" {
    address = "mimir:8080"
    tenant_id = "whatever"
    mimir_namespace_prefix = "alloy-a"
}

Cluster B config:

mimir.rules.kubernetes "local" {
    address = "mimir:8080"
    tenant_id = "whatever"
    mimir_namespace_prefix = "alloy-b"
}

This would ensure the Alloy instances (Alloys?) for each cluster are making changes to different sets of rules in Mimir.

iarlyy commented 3 months ago

@56quarters I figured that out yesterday, and it solved my issue :).

Thanks for looking into it.

juupas commented 1 month ago

As I already mentioned, I only have 1 Alloy running non-clustered, but adding a different "mimir_namespace_prefix" to all the "mimir.rules.kubernetes" blocks gets rid of the constant delete/recreate cycle.

Thanks @56quarters for suggesting this!