grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.02k stars 508 forks source link

Config is not imported Prometheus/Alertmanager #9085

Open flowramps opened 3 weeks ago

flowramps commented 3 weeks ago

Describe the bug

I can't load the settings for mimir alertmanager, looking at the documentation, it is instructed to load with mimirtool alertmanager load

To play

Steps to reproduce the behavior:

  1. I port-forwarded the service - kubectl port-forward svc/mimir-dev-alertmanager-headless 8080:8080 -n mimir-distributed-dev
  2. Commands used for reproduction,

Doc - https://grafana.com/docs/mimir/latest/references/architecture/components/alertmanager/

mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager --id="anonymous"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager --id="eks-dev"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager --id="eks-exemplo"

mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/alerts --id="anonymous"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/alerts --id="eks-dev"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/alerts --id="eks-exemplo"

mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/rules --id="anonymous"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/rules --id="eks-dev"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/rules --id="eks-exemplo"

image

image

  1. How do I find out who the tenant is?

???????

Expected behavior

Get information from the config and load it for configuration!

Environment

Additional Context

  1. How can I load the settings?

  2. How can I obtain tenants?

  3. Values.yaml is not imported

 prometheusRule:
    annotations: {}
    enabled: true
    groups:
    - name: mimir_dev_alerts
      rules:
      - alert: MimirIngesterUnhealthy
        annotations:
          message: Mimir cluster {{ $labels.cluster }}/{{ $labels.namespace }} has
            {{ printf "%f" $value }} unhealthy ingester(s).
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterunhealthy
        expr: |
          min by (cluster, namespace) (cortex_ring_members{state="Unhealthy", name="ingester"}) > 0
        for: 15m
        labels:
          severity: critical
      - alert: MimirRequestErrors
        annotations:
          message: |
            The route {{ $labels.route }} in {{ $labels.cluster }}/{{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrequesterrors
        expr: |
          # The following 5xx errors considered as non-error:
          # - 529: used by distributor rate limiting (using 529 instead of 429 to let the client retry)
          # - 598: used by GEM gateway when the client is very slow to send the request and the gateway times out reading the request body
          (
            sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..",status_code!~"529|598",route!~"ready|debug_pprof"}[1m]))
            /
            sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{route!~"ready|debug_pprof"}[1m]))
          ) * 100 > 1
        for: 15m
        labels:
          severity: critical
      - alert: MimirRequestLatency
        annotations:
          message: |
            {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrequestlatency
        expr: |
          cluster_namespace_job_route:cortex_request_duration_seconds:99quantile{route!~"metrics|/frontend.Frontend/Process|ready|/schedulerpb.SchedulerForFrontend/FrontendLoop|/schedulerpb.SchedulerForQuerier/QuerierLoop|debug_pprof"}
             >
          2.5
        for: 15m
        labels:
          severity: warning
      - alert: MimirInconsistentRuntimeConfig
        annotations:
          message: |
            An inconsistent runtime config file is used across cluster {{ $labels.cluster }}/{{ $labels.namespace }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirinconsistentruntimeconfig
        expr: |
          count(count by(cluster, namespace, job, sha256) (cortex_runtime_config_hash)) without(sha256) > 1
        for: 1h
        labels:
          severity: critical
      - alert: MimirBadRuntimeConfig
        annotations:
          message: |
            {{ $labels.job }} failed to reload runtime config.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirbadruntimeconfig
        expr: |
          # The metric value is reset to 0 on error while reloading the config at runtime.
          cortex_runtime_config_last_reload_successful == 0
        for: 5m
        labels:
          severity: critical
      - alert: MimirFrontendQueriesStuck
        annotations:
          message: |
            There are {{ $value }} queued up queries in {{ $labels.cluster }}/{{ $labels.namespace }} {{ $labels.job }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirfrontendqueriesstuck
        expr: |
          sum by (cluster, namespace, job) (min_over_time(cortex_query_frontend_queue_length[1m])) > 0
        for: 5m
        labels:
          severity: critical
      - alert: MimirSchedulerQueriesStuck
        annotations:
          message: |
            There are {{ $value }} queued up queries in {{ $labels.cluster }}/{{ $labels.namespace }} {{ $labels.job }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirschedulerqueriesstuck
        expr: |
          sum by (cluster, namespace, job) (min_over_time(cortex_query_scheduler_queue_length[1m])) > 0
        for: 7m
        labels:
          severity: critical
      - alert: MimirCacheRequestErrors
        annotations:
          message: |
            The cache {{ $labels.name }} used by Mimir {{ $labels.cluster }}/{{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimircacherequesterrors
        expr: |
          (
            sum by(cluster, namespace, name, operation) (
              rate(thanos_memcached_operation_failures_total[1m])
              or
              rate(thanos_cache_operation_failures_total[1m])
            )
            /
            sum by(cluster, namespace, name, operation) (
              rate(thanos_memcached_operations_total[1m])
              or
              rate(thanos_cache_operations_total[1m])
            )
          ) * 100 > 5
        for: 5m
        labels:
          severity: warning
      - alert: MimirIngesterRestarts
        annotations:
          message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
            }} has restarted {{ printf "%.2f" $value }} times in the last 30 mins.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterrestarts
        expr: |
          (
            sum by(cluster, namespace, pod) (
              increase(kube_pod_container_status_restarts_total{container=~"(ingester|mimir-write)"}[30m])
            )
            >= 2
          )
          and
          (
            count by(cluster, namespace, pod) (cortex_build_info) > 0
          )
        labels:
          severity: warning
      - alert: MimirKVStoreFailure
        annotations:
          message: |
            Mimir {{ $labels.pod }} in  {{ $labels.cluster }}/{{ $labels.namespace }} is failing to talk to the KV store {{ $labels.kv_name }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirkvstorefailure
        expr: |
          (
            sum by(cluster, namespace, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.+"}[1m]))
            /
            sum by(cluster, namespace, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count[1m]))
          )
          # We want to get alerted only in case there's a constant failure.
          == 1
        for: 5m
        labels:
          severity: critical
      - alert: MimirMemoryMapAreasTooHigh
        annotations:
          message: '{{ $labels.job }}/{{ $labels.pod }} has a number of mmap-ed areas
            close to the limit.'
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirmemorymapareastoohigh
        expr: |
          process_memory_map_areas{job=~".*/(ingester.*|cortex|mimir|mimir-write.*|store-gateway.*|cortex|mimir|mimir-backend.*)"} / process_memory_map_areas_limit{job=~".*/(ingester.*|cortex|mimir|mimir-write.*|store-gateway.*|cortex|mimir|mimir-backend.*)"} > 0.8
        for: 5m
        labels:
          severity: critical
      - alert: MimirIngesterInstanceHasNoTenants
        annotations:
          message: Mimir ingester {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
            }} has no tenants assigned.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterinstancehasnotenants
        expr: |
          (min by(cluster, namespace, pod) (cortex_ingester_memory_users) == 0)
          and on (cluster, namespace)
          # Only if there are more timeseries than would be expected due to continuous testing load
          (
            ( # Classic storage timeseries
              sum by(cluster, namespace) (cortex_ingester_memory_series)
              /
              max by(cluster, namespace) (cortex_distributor_replication_factor)
            )
            or
            ( # Ingest storage timeseries
              sum by(cluster, namespace) (
                max by(ingester_id, cluster, namespace) (
                  label_replace(cortex_ingester_memory_series,
                    "ingester_id", "$1",
                    "pod", ".*-([0-9]+)$"
                  )
                )
              )
            )
          ) > 100000
        for: 1h
        labels:
          severity: warning
      - alert: MimirRulerInstanceHasNoRuleGroups
        annotations:
          message: Mimir ruler {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
            }} has no rule groups assigned.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulerinstancehasnorulegroups
        expr: |
          # Alert on ruler instances in microservices mode that have no rule groups assigned,
          min by(cluster, namespace, pod) (cortex_ruler_managers_total{pod=~"(.*mimir-)?ruler.*"}) == 0
          # but only if other ruler instances of the same cell do have rule groups assigned
          and on (cluster, namespace)
          (max by(cluster, namespace) (cortex_ruler_managers_total) > 0)
          # and there are more than two instances overall
          and on (cluster, namespace)
          (count by (cluster, namespace) (cortex_ruler_managers_total) > 2)
        for: 1h
        labels:
          severity: warning
      - alert: MimirIngestedDataTooFarInTheFuture
        annotations:
          message: Mimir ingester {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
            }} has ingested samples with timestamps more than 1h in the future.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesteddatatoofarinthefuture
        expr: |
          max by(cluster, namespace, pod) (
              cortex_ingester_tsdb_head_max_timestamp_seconds - time()
              and
              cortex_ingester_tsdb_head_max_timestamp_seconds > 0
          ) > 60*60
        for: 5m
        labels:
          severity: warning
      - alert: MimirStoreGatewayTooManyFailedOperations
        annotations:
          message: Mimir store-gateway in {{ $labels.cluster }}/{{ $labels.namespace
            }} is experiencing {{ $value | humanizePercentage }} errors while doing
            {{ $labels.operation }} on the object storage.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirstoregatewaytoomanyfailedoperations
        expr: |
          sum by(cluster, namespace, operation) (rate(thanos_objstore_bucket_operation_failures_total{component="store-gateway"}[1m])) > 0
        for: 5m
        labels:
          severity: warning
      - alert: MimirRingMembersMismatch
        annotations:
          message: |
            Number of members in Mimir ingester hash ring does not match the expected number in {{ $labels.cluster }}/{{ $labels.namespace }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirringmembersmismatch
        expr: |
          (
            avg by(cluster, namespace) (sum by(cluster, namespace, pod) (cortex_ring_members{name="ingester",job=~".*/(ingester.*|cortex|mimir|mimir-write.*)",job!~".*/(ingester.*-partition)"}))
            != sum by(cluster, namespace) (up{job=~".*/(ingester.*|cortex|mimir|mimir-write.*)",job!~".*/(ingester.*-partition)"})
          )
          and
          (
            count by(cluster, namespace) (cortex_build_info) > 0
          )
        for: 15m
        labels:
          component: ingester
          severity: warning
    - name: mimir_dev_instance_limits_alerts
      rules:
      - alert: MimirIngesterReachingSeriesLimit
        annotations:
          message: |
            Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its series limit.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingserieslimit
        expr: |
          (
              (cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"})
              and ignoring (limit)
              (cortex_ingester_instance_limits{limit="max_series"} > 0)
          ) > 0.8
        for: 3h
        labels:
          severity: warning
      - alert: MimirIngesterReachingSeriesLimit
        annotations:
          message: |
            Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its series limit.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingserieslimit
        expr: |
          (
              (cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"})
              and ignoring (limit)
              (cortex_ingester_instance_limits{limit="max_series"} > 0)
          ) > 0.9
        for: 5m
        labels:
          severity: critical
      - alert: MimirIngesterReachingTenantsLimit
        annotations:
          message: |
            Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its tenant limit.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingtenantslimit
        expr: |
          (
              (cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"})
              and ignoring (limit)
              (cortex_ingester_instance_limits{limit="max_tenants"} > 0)
          ) > 0.7
        for: 5m
        labels:
          severity: warning
      - alert: MimirIngesterReachingTenantsLimit
        annotations:
          message: |
            Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its tenant limit.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingtenantslimit
        expr: |
          (
              (cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"})
              and ignoring (limit)
              (cortex_ingester_instance_limits{limit="max_tenants"} > 0)
          ) > 0.8
        for: 5m
        labels:
          severity: critical
      - alert: MimirReachingTCPConnectionsLimit
        annotations:
          message: |
            Mimir instance {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its TCP connections limit for {{ $labels.protocol }} protocol.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirreachingtcpconnectionslimit
        expr: |
          cortex_tcp_connections / cortex_tcp_connections_limit > 0.8 and
          cortex_tcp_connections_limit > 0
        for: 5m
        labels:
          severity: critical
      - alert: MimirDistributorReachingInflightPushRequestLimit
        annotations:
          message: |
            Distributor {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its inflight push request limit.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirdistributorreachinginflightpushrequestlimit
        expr: |
          (
              (cortex_distributor_inflight_push_requests / ignoring(limit) cortex_distributor_instance_limits{limit="max_inflight_push_requests"})
              and ignoring (limit)
              (cortex_distributor_instance_limits{limit="max_inflight_push_requests"} > 0)
          ) > 0.8
        for: 5m
        labels:
          severity: critical
    - name: mimir_dev-rollout-alerts
      rules:
      - alert: MimirRolloutStuck
        annotations:
          message: |
            The {{ $labels.rollout_group }} rollout is stuck in {{ $labels.cluster }}/{{ $labels.namespace }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrolloutstuck
        expr: |
          (
            max without (revision) (
              sum without(statefulset) (label_replace(kube_statefulset_status_current_revision, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
                unless
              sum without(statefulset) (label_replace(kube_statefulset_status_update_revision, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
            )
              *
            (
              sum without(statefulset) (label_replace(kube_statefulset_replicas, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
                !=
              sum without(statefulset) (label_replace(kube_statefulset_status_replicas_updated, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
            )
          ) and (
            changes(sum without(statefulset) (label_replace(kube_statefulset_status_replicas_updated, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))[15m:1m])
              ==
            0
          )
          * on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
        for: 30m
        labels:
          severity: warning
          workload_type: statefulset
      - alert: MimirRolloutStuck
        annotations:
          message: |
            The {{ $labels.rollout_group }} rollout is stuck in {{ $labels.cluster }}/{{ $labels.namespace }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrolloutstuck
        expr: |
          (
            sum without(deployment) (label_replace(kube_deployment_spec_replicas, "rollout_group", "$1", "deployment", "(.*?)(?:-zone-[a-z])?"))
              !=
            sum without(deployment) (label_replace(kube_deployment_status_replicas_updated, "rollout_group", "$1", "deployment", "(.*?)(?:-zone-[a-z])?"))
          ) and (
            changes(sum without(deployment) (label_replace(kube_deployment_status_replicas_updated, "rollout_group", "$1", "deployment", "(.*?)(?:-zone-[a-z])?"))[15m:1m])
              ==
            0
          )
          * on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
        for: 30m
        labels:
          severity: warning
          workload_type: deployment
      - alert: RolloutOperatorNotReconciling
        annotations:
          message: |
            Rollout operator is not reconciling the rollout group {{ $labels.rollout_group }} in {{ $labels.cluster }}/{{ $labels.namespace }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#rolloutoperatornotreconciling
        expr: |
          max by(cluster, namespace, rollout_group) (time() - rollout_operator_last_successful_group_reconcile_timestamp_seconds) > 600
        for: 5m
        labels:
          severity: critical
    - name: mimir_dev-provisioning
      rules:
      - alert: MimirAllocatingTooMuchMemory
        annotations:
          message: |
            Instance {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} is using too much memory.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirallocatingtoomuchmemory
        expr: |
          (
            # We use RSS instead of working set memory because of the ingester's extensive usage of mmap.
            # See: https://github.com/grafana/mimir/issues/2466
            container_memory_rss{container=~"(ingester|mimir-write|mimir-backend)"}
              /
            ( container_spec_memory_limit_bytes{container=~"(ingester|mimir-write|mimir-backend)"} > 0 )
          )
          # Match only Mimir namespaces.
          * on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
          > 0.65
        for: 15m
        labels:
          severity: warning
      - alert: MimirAllocatingTooMuchMemory
        annotations:
          message: |
            Instance {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} is using too much memory.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirallocatingtoomuchmemory
        expr: |
          (
            # We use RSS instead of working set memory because of the ingester's extensive usage of mmap.
            # See: https://github.com/grafana/mimir/issues/2466
            container_memory_rss{container=~"(ingester|mimir-write|mimir-backend)"}
              /
            ( container_spec_memory_limit_bytes{container=~"(ingester|mimir-write|mimir-backend)"} > 0 )
          )
          # Match only Mimir namespaces.
          * on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
          > 0.8
        for: 15m
        labels:
          severity: critical
    - name: ruler_alerts
      rules:
    #labels:
    #  release: prometheus
    mimirAlerts: true
    mimirRules: true
    namespace: mimir-distributed-dev
  serviceMonitor:
    enabled: true
    #labels:
    #  release: prometheus
metadata-cache:
  enabled: true
flowramps commented 2 weeks ago

Updates

I was able to insert rules and alerts with the following commands!

  1. kubectl port-forward svc/mimir-dev-ruler 8080:8080 -n mimir-distributed-dev mimirtool rules load rules.yaml --address=http://127.0.0.1:8080/ --id="anonymous"

image

image

  1. kubectl port-forward svc/mimir-dev-alertmanager-headless 8080:8080 -n mimir-distributed-dev mimirtool alertmanager load alertmanager-config.yaml alerts.yaml --address=http://127.0.0.1:8080/ --id="anonymous" image

ATTENTION !!!!!!!!!!!!!!

I can access the API and view the rules created and one of these rules has an active alert, but the alerts don't arrive in my alertmanager!

image

mimirtool alertmanager load alertmanager-config.yaml alerts.yaml

  1. Another point that I was unable to understand is how I can validate after applying the alerts.yaml file configuration ?

  2. I can view the alertmanager-config.yaml file in the configd via UI and see that it was applied

image

If anyone knows how I should proceed, I would be very grateful to complete my configuration and understanding of the ecosystem.