grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.8k stars 3.43k forks source link

Loki alarms not triggered to alertmanager with Monolitic Installation and StorageType filesystem #10480

Open murand78 opened 1 year ago

murand78 commented 1 year ago

Describe the bug Loki alarms not triggered to alertmanager with Monolitic Installation and StorageType filesystem

To Reproduce Install loki 2.8 with helm chart version 5.15.0 and following values.yaml

  global:
   dnsService: "rke2-coredns-rke2-coredns"

  loki:
    auth_enabled: false
    commonConfig:
      replication_factor: 1
    storage:
      type: filesystem
    rulerConfig:
      query_stats_enabled: true
      alertmanager_url: http://rancher-monitoring-alertmanager.cattle-monitoring-system:9093
      enable_alertmanager_v2: true
      enable_api: true
      ring:
        kvstore:
          store: inmemory
      rule_path: /tmp/scratch
      storage:
        local:
          directory: /rules
        type: local
  minio:
    enabled: false
  gateway:
    replicas: 1
  singleBinary:
    replicas: 1
    persistence:
      enabled: true
      size: 20Gi
    extraVolumes:
    - name: rules
      configMap:
        name: loki-alerting-rules
        defaultMode: 420
    extraVolumeMounts:
        - name: rules
          mountPath: /rules

ConfigMap configured for test alerts:

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-alerting-rules
data:
  loki-alerting-rules.yaml: |-
    groups:
      - name: cnpg
        rules:
          - alert: TestAlertOnLog
            annotations:
              title: "Test Alert On Log"
            expr: >-
              count_over_time( {namespace="loki"} |= "executing query" [1m]) > 5
            for: 5m
            labels:
              category: logs
              severity: critical

Expected behavior An alert is sent to the alertmanager. Looking at loki-loggin-0 pod logs the query is correctly executed but the call to the alertmanager is not triggered. This is test query and exploring loki logs the metric is always above this threshold, the alert shall always firing. If I broke the alertmanager config with an Invalid URL no error message are displayed on logs.

Switching the storage type to s3 with ( minio ) all works fine, the alert is correctly sent to the alertmanager updating the values with:

  loki:
    storage:
      type: s3
  minio:
    enabled: true

Also the the installation with Simple scalable deployment mode works fine as expected.

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.

LOGS from loki-logging-0 with filesystem storage: ( not firing alerts )

level=info ts=2023-09-06T14:15:14.662010664Z caller=table.go:334 msg="finished handing over table loki_index_19605"
level=info ts=2023-09-06T14:15:14.662050108Z caller=table.go:318 msg="handing over indexes to shipper loki_index_19606"
level=info ts=2023-09-06T14:15:14.662057636Z caller=table.go:334 msg="finished handing over table loki_index_19606"
level=info ts=2023-09-06T14:15:29.15159034Z caller=engine.go:218 component=ruler org_id=..data msg="executing query" type=instant query="(count_over_time({namespace=\"loki\"} |= \"executing query\"[1m]) > 5)" query_hash=2653152980
level=info ts=2023-09-06T14:15:29.153498641Z caller=metrics.go:152 component=ruler org_id=..data latency=fast query="(count_over_time({namespace=\"loki\"} |= \"executing query\"[1m]) > 5)" query_hash=2653152980 query_type=metric range_type=instant length=0s start_delta=3.880611ms end_delta=3.880806ms step=0s duration=1.81196ms status=200 limit=0 returned_lines=0 throughput=8.4MB total_bytes=15kB lines_per_second=68986 total_lines=125 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s
level=info ts=2023-09-06T14:15:52.358602204Z caller=engine.go:218 component=ruler org_id=..2023_09_06_12_52_05.466622597 msg="executing query" type=instant query="(count_over_time({namespace=\"loki\"} |= \"executing query\"[1m]) > 5)" query_hash=2653152980
level=info ts=2023-09-06T14:15:52.360213236Z caller=metrics.go:152 component=ruler org_id=..2023_09_06_12_52_05.466622597 latency=fast query="(count_over_time({namespace=\"loki\"} |= \"executing query\"[1m]) > 5)" query_hash=2653152980 query_type=metric range_type=instant length=0s start_delta=3.552265ms end_delta=3.552454ms step=0s duration=1.520008ms status=200 limit=0 returned_lines=0 throughput=10MB total_bytes=15kB lines_per_second=82236 total_lines=125 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s
level=info ts=2023-09-06T14:16:14.645610146Z caller=table_manager.go:134 msg="uploading tables"
level=info ts=2023-09-06T14:16:14.645647555Z caller=index_set.go:86 msg="uploading table loki_index_19605"
level=info ts=2023-09-06T14:16:14.645655489Z caller=index_set.go:107 msg="finished uploading table loki_index_19605

LOGS from loki-logging-0 with s3 storage: ( firing alert )

level=info ts=2023-09-06T14:51:29.154485819Z caller=metrics.go:152 component=ruler org_id=..data latency=fast query="(count_over_time({namespace=\"loki\"} |= \"executing query\"[1m]) > 5)" query_hash=2653152980 query_type=metric range_type=instant length=0s start_delta=4.867371ms end_delta=4.867594ms step=0s duration=2.865717ms status=200 limit=0 returned_lines=0 throughput=172MB total_bytes=493kB lines_per_second=368145 total_lines=1055 total_entries=1 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s
level=info ts=2023-09-06T14:51:45.646006254Z caller=engine.go:218 component=ruler org_id=..2023_09_06_14_41_21.649995638 msg="executing query" type=instant query="(count_over_time({namespace=\"loki\"} |= \"executing query\"[1m]) > 5)" query_hash=2653152980
level=info ts=2023-09-06T14:51:45.652275694Z caller=metrics.go:152 component=ruler org_id=..2023_09_06_14_41_21.649995638 latency=fast query="(count_over_time({namespace=\"loki\"} |= \"executing query\"[1m]) > 5)" query_hash=2653152980 query_type=metric range_type=instant length=0s start_delta=7.762955ms end_delta=7.763146ms step=0s duration=6.170386ms status=200 limit=0 returned_lines=0 throughput=25MB total_bytes=153kB lines_per_second=65474 total_lines=404 total_entries=1 store_chunks_download_time=0s queue_time=0s splits=0 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s
level=info ts=2023-09-06T14:51:45.957641925Z caller=table_manager.go:134 msg="uploading tables"
level=info ts=2023-09-06T14:51:45.95767799Z caller=index_set.go:86 msg="uploading table loki_index_19606"
level=info ts=2023-09-06T14:51:45.957686445Z caller=index_set.go:107 msg="finished uploading table loki_index_19606"
level=info ts=2023-09-06T14:51:45.95769316Z caller=index_set.go:185 msg="cleaning up unwanted indexes from table loki_index_19606"
igor-borisoglebski commented 9 months ago

On monolithic setups try with this config

values.yaml

loki:
  read:
    extraVolumeMounts:
      - name: rules
        mountPath: "/var/loki/rulestorage/fake"
    extraVolumes:
      - name: rules
        configMap:
          name: loki-alerting-rules

This will mount the data from the ConfigMap we created into the pods at /var/loki/rulestorage/fake in a file named rules.yaml. The reason fake is in the path is that this is the instance ID when running in single-tenancy mode. The docs for Loki do not explain this at all but it will not work without that in the path.

Source: https://dbadbadba.com/blog/finocchiaro-loki