grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.15k stars 535 forks source link

Alertmanager fallback configuration doesn't support templates #3171

Open Ferrany1 opened 2 years ago

Ferrany1 commented 2 years ago

Describe the bug

Alertmanager can't parse definition from custom template, resulting into empty telegram message error on send try

To Reproduce

Steps to reproduce the behavior:

docker-compose.yaml

  monitoring_host_mimir-1:
    container_name: monitoring_host_mimir-1
    image: grafana/mimir:2.3.1
    user: "0"
    command: [ "-config.file=/etc/mimir/mimir.yml" ]
    restart: unless-stopped
    logging: *default-logging
    volumes:
      - ../backup/mimir-1:/mimir
      - ../config/mimir/mimir.yml:/etc/mimir/mimir.yml:ro
      - ../config/mimir/alertmanager/alerts/:/mimir/fs_rules/anonymous/:ro
      - ../config/mimir/alertmanager/alertmanager.yml:/etc/mimir/alertmanager.yml:ro
    expose:
      - 8080

alertmanager.yml

templates:
- '/etc/alertmanager/templates/telegram.tmpl'

receivers:
  - name: "telegram"
    telegram_configs:
      - bot_token:
        chat_id:
        api_url: https://api.telegram.org
        message:'{{ template "telegram.message" . }}'

alertmanager.yml

receivers:
  - name: "telegram"
    telegram_configs:
      - bot_token:
        chat_id:
        api_url: https://api.telegram.org
        message: '{{ template "telegram.message" . }}'
        parse_mode: HTML

telegram.tmpl

{{ define "__alertmanager" }}Alertmanager{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }}{{ end }}

{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__description" }}{{ end }}

{{ define "telegram.message" }}
{{ if gt (len .Alerts.Firing) 0 }}
Alerts Firing:
{{ template "__text_alert_list" .Alerts.Firing }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
Alerts Resolved:
{{ template "__text_alert_list" .Alerts.Resolved }}
{{ end }}
{{ end }}

Expected behavior

Alert sent to telegram message

Environment

Additional Context

I've tested template via https://github.com/prometheus/alertmanager/blob/main/template/template_test.go with '{{ template "telegram.message" . }}' and everything works correctly, I haven't tried to deploy standalone Prometheus Alertmanager, but it seems it may work fine.

Currently to make everything work I've putted template message part itself fully into alertmanager.yml message field without ref and works ok.

dimitarvdimitrov commented 2 years ago

In your example the telegram.tmpl isn't mounted on the container. Could this be the problem?

Ferrany1 commented 2 years ago

No, sorry I've copied latest version without mount, however I was testing with proper mount + I've checked file actually mounted at path

pracucci commented 2 years ago

Are you configuring alertmanager.yml uploading it with mimirtool alertmanager load command (along with the template) or are you configuring it as fallback Alertmanager configuration?

Ferrany1 commented 2 years ago

As a fallback

mimir.yml part

alertmanager:
  data_dir: /mimir/alertmanager
  fallback_config_file: /etc/mimir/alertmanager.yml
  external_url: http://127.0.0.1:8080/alertmanager
pracucci commented 2 years ago

You raised a very good point. The alertmanager fallback configuration currently doesn't support templates. This is something we should fix.

As a workaround, could upload the alertmanager yaml config + templates using mimirtool alertmanager load instead (doc) for the specific tenant?

Ferrany1 commented 2 years ago

I've managed it by putting message full message template (without def) into alertmanager.yml If you could point me onto loader I'me have a look into it and maybe make some pr with fixes, obviously if its needed and team not currently working on it

pracucci commented 2 years ago

If you could point me onto loader I'me have a look into it and maybe make some pr with fixes, obviously if its needed and team not currently working on it

None is working on it and we would love your help! ❤️

The fallback config is loaded from here: https://github.com/grafana/mimir/blob/main/pkg/alertmanager/multitenant.go#L840

The alertmanagerFromFallbackConfig() is a bit tricky. The way it works is creating an empty config definition and store it in the backend storage: https://github.com/grafana/mimir/blob/3c8fabdbece41f894a49c7024cdd5982fa26924d/pkg/alertmanager/multitenant.go#L864-L868

Then we call setConfig() which loads the fallback config if the config is empty (was forcefully set to empty in alertmanagerFromFallbackConfig): https://github.com/grafana/mimir/blob/3c8fabdbece41f894a49c7024cdd5982fa26924d/pkg/alertmanager/multitenant.go#L675-L684

paulojmdias commented 1 year ago

Is someone working on it? Or we can think of contributing to it?

pracucci commented 1 year ago

Is someone working on it? Or we can think of contributing to it?

None is working on it. You're welcome to contribute! ❤️

achetronic commented 1 year ago

Definetly this is something I would love to have <3


As an idea, what about simply creating some little watcher for k8s to detect changes on a configmap with the configs and then use mimirtools to upload it from time to time?

eric-engberg commented 1 year ago

Running into the same issue. Chose to try to use the fallback config as there's no way to configure alertmanager configs without mimirtool (don't want a manual step to configuring mimir). Is there any other solution currently? Anyone ever work on this? I'm not capable of doing it myself.

achetronic commented 1 year ago

Running into the same issue. Chose to try to use the fallback config as there's no way to configure alertmanager configs without mimirtool (don't want a manual step to configuring mimir). Is there any other solution currently? Anyone ever work on this? I'm not capable of doing it myself.

As I said previously, if you run mimirtool in a cronjob, with your config and your templates loaded into it, you can upload your config, lets say, in 5m lapses periodically and it's automated. We use it that way and it's working well 😊

Ferrany1 commented 1 year ago

@pracucci Seems like I've managed to fix it, but I've no idea how to write tests to check it, since you're not testing alertmanager in mimir, and the only option for me is to put alerts for notifyer and they are executed directly to receivers.

I've tested it locally on such configs:

mimir.yaml:

target: all,alertmanager,ruler

multitenancy_enabled: false
no_auth_tenant: "anonymous"

blocks_storage:
  backend: filesystem
  bucket_store:
    sync_dir: ./temp/tsdb-sync
  filesystem:
    dir: ./temp/data/tsdb
  tsdb:
    dir: ./temp/tsdb

compactor:
  data_dir: ./temp/compactor
  sharding_ring:
    kvstore:
      store: memberlist

distributor:
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: memberlist

ingester:
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: memberlist
    replication_factor: 1

ruler:
  alertmanager_url: http://127.0.0.1:8080/alertmanager

ruler_storage:
  backend: local
  local:
    directory: ./temp/fs_rules

alertmanager:
  data_dir: ./temp/alertmanager
  fallback_config_file: ./alertmanager.yaml
  external_url: http://127.0.0.1:8080/alertmanager

alertmanager_storage:
  backend: filesystem
  filesystem:
    dir: ./temp/alerts

limits:
  max_label_names_per_series: 100

server:
  log_level: warn
  http_listen_port: 8080

store_gateway:
  sharding_ring:
    replication_factor: 1

alertmanager.yaml:

route:
  repeat_interval: 30s
  group_interval: 60s
  group_wait: 30s
  receiver: 'telegram'

templates:
  - './telegram.tmpl'

receivers:
  - name: "telegram"
    telegram_configs:
      - bot_token: ''
        chat_id: ''
        api_url: https://api.telegram.org
        message: '{{ template "telegram.message" . }}'

telegram.tmpl

{{ define "telegram.message" }}
test
{{ end }}
Ferrany1 commented 1 year ago

Sorry for taking it too long, I tottaly forgot about this issue for year

Ferrany1 commented 1 year ago

@pracucci Can you help me with pr?

grobinson-grafana commented 1 year ago

@pracucci before I take a look at #6495, is it possible Mimir doesn't support templates in the fallback configuration on purpose to avoid a situation where the fallback configuration fails for the same reason as the main configuration (i.e. a shared, bad template)?

pracucci commented 1 year ago

before I take a look at https://github.com/grafana/mimir/pull/6495, is it possible Mimir doesn't support templates in the fallback configuration on purpose to avoid a situation where the fallback configuration fails for the same reason as the main configuration (i.e. a shared, bad template)?

We should ask @gotjosh and @stevesg cause they know better. I don't remember any discussion where we decided to not do it on purpose. I've more the feeling this was an oversight from us.

However, I think we should ideally validate the fallback config and not start the alertmanager if some required templates are missing.

grobinson-grafana commented 1 year ago

However, I think we should ideally validate the fallback config and not start the alertmanager if some required templates are missing.

The main issue here is that the template in the fallback configuration can fail at runtime. Not because it's absent on disk, but because of a syntax error in the template or it attempts to access a field in a struct which does not exist. A lot of this can be mitigated with static analysis, but it's a lot of work.

Ferrany1 commented 11 months ago

Do you still need my pr attached to this issue, or I can abandon it?