enable_alertmanager_discovery doesn't work with Loki scalable Helm Chart and HA Alertmanager

joe-alford commented 2 years ago

I initially posted this in the cortex repo, but they have declined ownership.

Describe the bug Using Kubernetes: when making use of the enable_alertmanager_discovery (-ruler.alertmanager-discovery) flag, we are seeing the following line of code get hit erroneously:

https://github.com/cortexproject/cortex/blob/cd786078a220ca0e6f9bcd510ed8170e457bc2f8/pkg/ruler/notifier.go#L110

This only happens if the URL is in the 'correct' format. If we pass in an 'invalid' URL, then the code works as expected. With the URL in a SRV DNS format, the URL is treated as an empty string.

To Reproduce Steps to reproduce the behavior: Deploy the HelmRelease file below and you will get the below.

With the URL in the correct format, then the following error is generated, but only on the read pod. This is https://github.com/cortexproject/cortex/blob/master/pkg/ruler/notifier.go#L110

kubectl describe helmreleases.helm.toolkit.fluxcd.io -n loki loki
Name:         loki
Namespace:    loki
Labels:       kustomize.toolkit.fluxcd.io/name=apps
              kustomize.toolkit.fluxcd.io/namespace=flux-system
...
        Ruler:
          alertmanager_url:               _http-web._tcp.kube-prometheues-stack-kub-alertmanager.kube-prometheus-stack.svc.cluster.local
          enable_alertmanager_discovery:  true
          enable_api:                     true
          rule_path:                      /tmp/scratch
          Storage:
            Local:
              Directory:  /rules
            Type:         local

which gives this error:

kubectl logs -n loki loki-loki-simple-scalable-read-0
level=info ts=2022-05-10T09:30:30.7647066Z caller=main.go:106 msg="Starting Loki" version="(version=2.5.0, branch=HEAD, revision=2d9d0ee23)"
level=info ts=2022-05-10T09:30:30.7652983Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2022-05-10T09:30:30.7663564Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=loki-loki-simple-scalable-read-0-b9f14106
level=warn ts=2022-05-10T09:30:30.7688221Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=info ts=2022-05-10T09:30:30.7690877Z caller=table_manager.go:358 msg="loading local table index_19121"
level=info ts=2022-05-10T09:30:30.7692321Z caller=table_manager.go:358 msg="loading local table index_19122"
level=info ts=2022-05-10T09:30:30.7693359Z caller=shipper_index_client.go:111 msg="starting boltdb shipper in 1 mode"
level=info ts=2022-05-10T09:30:30.7699475Z caller=worker.go:112 msg="Starting querier worker using query-scheduler and scheduler ring for addresses"
level=error ts=2022-05-10T09:30:30.7700485Z caller=log.go:100 msg="error running loki" err="when alertmanager-discovery is on, host name must be of the form _portname._tcp.service.fqdn (is \"\")\nerror initialising module: ruler\ngithub.com/grafana/dskit/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108\ngithub.com/grafana/dskit/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:78\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:339\nmain.main\n\t/src/loki/cmd/loki/main.go:108\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"

For additional context, with the URL in the 'wrong' format, we get the below error:

kubectl describe helmreleases.helm.toolkit.fluxcd.io -n loki loki
Name:         loki
Namespace:    loki
Labels:       kustomize.toolkit.fluxcd.io/name=apps
              kustomize.toolkit.fluxcd.io/namespace=flux-system
...
        Ruler:
          alertmanager_url:  http://kube-prometheues-stack-kub-alertmanager.kube-prometheus-stack.svc.cluster.local:9093/
          enable_api:        true
          rule_path:         /tmp/scratch
          Storage:
            Local:
              Directory:  /rules
            Type:         local

which gives the following error for the loki-read pod

kubectl logs -n loki loki-loki-simple-scalable-read-0
level=info ts=2022-05-10T09:19:51.2584656Z caller=main.go:106 msg="Starting Loki" version="(version=2.5.0, branch=HEAD, revision=2d9d0ee23)"
level=info ts=2022-05-10T09:19:51.2589196Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2022-05-10T09:19:51.2595947Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=loki-loki-simple-scalable-read-0-f838eb28
level=warn ts=2022-05-10T09:19:51.2625326Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=info ts=2022-05-10T09:19:51.2629938Z caller=table_manager.go:358 msg="loading local table index_19121"
level=info ts=2022-05-10T09:19:51.2636747Z caller=table_manager.go:358 msg="loading local table index_19122"
level=info ts=2022-05-10T09:19:51.264368Z caller=shipper_index_client.go:111 msg="starting boltdb shipper in 1 mode"
level=error ts=2022-05-10T09:19:51.2646458Z caller=log.go:100 msg="error running loki" err="when alertmanager-discovery is on, host name must be of the form _portname._tcp.service.fqdn (is \"kube-prometheues-stack-kub-alertmanager.kube-prometheus-stack.svc.cluster.local:9093\")\nerror initialising module: ruler\ngithub.com/grafana/dskit/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108\ngithub.com/grafana/dskit/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:78\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:339\nmain.main\n\t/src/loki/cmd/loki/main.go:108\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581"

Workaround

Build out a list of alertmanager targets manually with the following:

alertmanager_url: kube-prometheues-stack-kub-alertmanager-0.kube-prometheus-stack.svc.cluster.local, kube-prometheues-stack-kub-alertmanager-1.kube-prometheus-stack.svc.cluster.local, kube-prometheues-stack-kub-alertmanager-2.kube-prometheus-stack.svc.cluster.local

Expected behavior The URL is parsed as provided, and is not treated as an empty string.

Environment:

Kubernetes on EKS
Deployment tool: Helm
Helm Chart: loki-simple-scalable 0.4.0

Additional Context Helm Release (full file included, but is at alertmanager_url and enable_alertmanager_discovery):

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: loki
  namespace: loki
spec:
  chart:
    spec:
      chart: loki-simple-scalable
      version: 0.4.0
      sourceRef:
        kind: HelmRepository
        name: our-private-repo
        namespace: flux-system
  interval: 1m
  values:
    serviceMonitor:
      enabled: true
    gateway:
      image:
          repository: nginxinc/nginx-unprivileged
      service:
        port: 3100
      nginxConfig:
        serverSnippet: |
          location ~ /loki/api/v1/alerts.* {
            proxy_pass       http://loki-loki-simple-scalable-read.loki.svc.cluster.local:3100$request_uri;
          }

          location ~ /prometheus/api/v1/rules.* {
            proxy_pass       http://loki-loki-simple-scalable-read.loki.svc.cluster.local:3100$request_uri;
          }
        httpSnippet: |
          client_max_body_size 0;
    write:
      replicas: 1
      resources:
        limits:
          memory: "4Gi"
      persistence:
        size: 10Gi
        storageClass: gp3 #this is the default, but calling it out explicity so it can be overriden for dev
    read:
      replicas: 3
      persistence:
        size: 10Gi
        storageClass: gp3
      extraVolumeMounts:
        - name: loki-rules
          mountPath: /rules/fake
        - name: loki-rules-tmp
          mountPath: /tmp/scratch
        - name: loki-tmp
          mountPath: /tmp/loki-tmp
      extraVolumes:
        - name: loki-rules
          configMap:
            name: loki-alerting-rules
        - name: loki-rules-tmp
          emptyDir: {}
        - name: loki-tmp
          emptyDir: {}    
    loki: 
      structuredConfig:
        memberlist:
         join_members:
            - loki-loki-simple-scalable-memberlist.loki.svc.cluster.local
        auth_enabled: false
        server:
          http_listen_port: 3100
          log_level: info
          grpc_server_max_recv_msg_size: 104857600
          grpc_server_max_send_msg_size: 104857600
        schema_config: 
          configs:
          - from: "2020-11-04"
            store: boltdb-shipper
            object_store: aws
            schema: v11
            index:
              prefix: index_
              period: 24h
        storage_config: 
          boltdb_shipper:
            active_index_directory: /var/loki/index
            cache_location: /var/loki/boltdb-cache
            shared_store: s3
        ruler:
          storage:
            type: local
            local:
              directory: /rules
          rule_path: /tmp/scratch
          alertmanager_url: _http-web._tcp.kube-prometheues-stack-kub-alertmanager.kube-prometheus-stack.svc.cluster.local
          enable_alertmanager_discovery: true 
          enable_api: true
        limits_config:
          enforce_metric_name: false
          reject_old_samples: true
          reject_old_samples_max_age: 168h
          ingestion_rate_mb: 30
          ingestion_burst_size_mb: 16
          retention_period: 336h
          max_query_lookback: 336h
          max_streams_per_user: 0
          max_global_streams_per_user: 0
        compactor:
          working_directory: /var/loki/boltdb-shipper-compactor
          shared_store: filesystem
          retention_enabled: true
        chunk_store_config:
          chunk_cache_config:
            enable_fifocache: true
            fifocache:
              max_size_bytes: 500MB
        query_range:
          results_cache:
            cache:
              enable_fifocache: true
              fifocache:
                max_size_bytes: 500MB
        analytics:
          reporting_enabled: false
        ingester:
          max_chunk_age: 1h

pdf commented 2 years ago

I just ran into this, and the related documentation is poor. The alertmanager_url field must still be a valid URI, so prepending any scheme to the hostname appears to allow at least this validatoin code and subsequent DNS lookup to function as expected, e.g.:

alertmanager_url: dns://_http-web._tcp.kube-prometheues-stack-kub-alertmanager.kube-prometheus-stack.svc.cluster.local
enable_alertmanager_discovery: true

I note that the tests use the format http://_http._tcp.alertmanager.default.svc.cluster.local/alertmanager which suggests that an URL with path/port components might work, however if you include a port number in the URI DNS resolution breaks, so I'm not sure this is a bug or whether only the host component is used from the input, and the scheme/path/etc are dropped.

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

dragoangel commented 1 year ago

Here is working configuration for discovery:

  rulerConfig:
    alertmanager_url: http://_http-web._tcp.alertmanager-operated.monitoring.svc.cluster.local
    enable_api: true
    enable_alertmanager_discovery: true
    enable_alertmanager_v2: true

Port will be taken from SRV record. I get alerting working with such setup.

grafana / loki

enable_alertmanager_discovery doesn't work with Loki scalable Helm Chart and HA Alertmanager #6141