prometheus / alertmanager

Prometheus Alertmanager
https://prometheus.io
Apache License 2.0
6.64k stars 2.15k forks source link

Notifications are not resilient to bad templates #3490

Open jmcarp opened 1 year ago

jmcarp commented 1 year ago

What did you do?

Mis-configured a notification such that a template string was invalid and could not be rendered.

What did you expect to see?

What did you see instead? Under which circumstances?

Validation with amtool didn't show any errors, and notifications were dropped.

Environment

We're running alertmanager 0.25, but I believe this issue affects all versions.

Logs look like this:

alertmanager ts=2023-08-25T20:43:03.251Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="opsgenie-receiver/opsgenie[0]: notify retry canceled due to unrecoverable error after 1 attempts: templating error: template: :1: bad character U+002D '-'"

I would be happy to write a pr to either cause amtool to check for invalid templates or to enable notifiers to send partial notifications on template errors—for example, dropping fields that fail to render. Or if there's a better approach, let me know!

grobinson-grafana commented 1 year ago

Hi! 👋 Unless I've misread the issue, it looks like the template was valid but that the text output from the template contained unicode that Opsgenie does not support? How would amtool check for this?

jmcarp commented 1 year ago

Sorry I was unclear about this! This is a templating error that happens before we even make a request to the opsgenie api. So aiui it should be possible to detect statically.

grobinson-grafana commented 1 year ago

Apologies, I have indeed misread the issue! What is the output of amtool when testing this template with the same data?

simonpasquier commented 1 year ago

@jmcarp can you share the config that triggered the error?

jmcarp commented 1 year ago

Here's a simplified version:

route:
  receiver: opsgenie
receivers:
- name: opsgenie
  opsgenie_configs:
  - send_resolved: false
    api_key: redacted
    message: '{{ .GroupLabels.alertname }'

There's a missing closing brace in the message template, so all opsgenie notifications will fail during templating, before we send a request to the api. And we don't detect any problems with amtool, because aiui amtool doesn't look at templated fields within receivers (although it does look at the top-level templates field, which I'm not using here):

❯ amtool check-config am.yaml
Checking 'am.yaml'  SUCCESS
Found:
 - global config
 - route
 - 0 inhibit rules
 - 1 receivers
 - 0 templates

So to restate the ask in context: I would like amtool to throw an error on this example configuration file, and/or I would like the opsgenie notifier to optionally log a warning and omit invalid fields from the request, but not block notification on template errors. I'm happy to write a patch for either feature if maintainers are open to it.

grobinson-grafana commented 1 year ago

I think what's happening here is that am.yaml contains an inline template in the message field rather than executing a template in templates, and inline templates are not checked.

Here is an example of how it would look using template files:

route:
  receiver: opsgenie
receivers:
- name: opsgenie
  opsgenie_configs:
  - send_resolved: false
    api_key: redacted
    message: '{{ template "opsgenie.message" . }}'
templates:
  - tmpls/*.tmpl

That said, I think your argument is still valid because you could also have an error in the inline template {{ template "opsgenie.message" . }}.

simonpasquier commented 1 year ago

Validating fields supporting inline templates when loading the Alertmanager configuration would definitely be a good addition.

jmcarp commented 1 year ago

It looks like each notifier handles inline templating differently. What do you think about adding a validator interface that notifiers can optionally implement, and which we can check via amtool check-config?

type Validator interface {
    Validate() error
}

Then in the opsgenie package:

func (n *Notifier) Validate() error {
   ...
}

This method could check for invalid template strings, but also do other per-notifier validation as needed.

simonpasquier commented 1 year ago

I wonder if the validation could be performed in each notifier's New() since it gets the template engine as a parameter? And amtool check-config could replicate/mimic the buildReceiverIntegrations() function.

grobinson-grafana commented 1 year ago

Another option, although I'm not sure how viable, could we annotate the config structs and then use reflect?

For example:

type SlackConfig struct {
    NotifierConfig `yaml:",inline" json:",inline"`
    ...
    Title       string         `yaml:"title,omitempty" json:"title,omitempty" template:"optional"`
    TitleLink   string         `yaml:"title_link,omitempty" json:"title_link,omitempty" template:"optional"`
    Pretext     string         `yaml:"pretext,omitempty" json:"pretext,omitempty" template:"optional"`
    Text        string         `yaml:"text,omitempty" json:"text,omitempty" template:"optional"`
    ...
}
jmcarp commented 1 year ago

I started drafting a simple validation approach in https://github.com/prometheus/alertmanager/compare/main...jmcarp:alertmanager:validate-opsgenie-config?expand=1. The idea is to exercise opsgenie template rendering without sending a request in New(), which is how I interpreted @simonpasquier's suggestion. I think I prefer this to annotating config structs, since the validation method exercises the same logic that's used when we actually send notifications.

What do you think?

grobinson-grafana commented 1 year ago

I see a couple of issues:

  1. You'll need to implement the Validate method, and write tests, for each supported notifier. I think there are about 14 in total. That's quite a lot of work.

  2. It looks like you not only want to check the syntax of inline templates but also execute them to find not just syntax but also runtime errors. However, that's not what check-config does for template files, and this task is instead left to amtool template render. I think if you want to execute inline templates via check-config then you'll also need to do the same for template files too?

jmcarp commented 1 year ago

Following the suggestion from @simonpasquier, I added validation logic to the opsgenie notifier constructor rather than adding a new required Validate method. I agree that all notifiers should validate their configuration statically as thoroughly as possible, but I don't know that we need to block improving validation for one notifier on improving validation for all of them. What do you think?

As for template files, I'm happy to make a change in check-config if useful. Although if I understand correctly, top-level templates are meant to be used in receiver templates, so rendering receiver templates in check-config may address that issue for free.

I also noticed that the opsgenie notifier's config struct does some partial template validation in its UnmarshalYAML method: https://github.com/prometheus/alertmanager/blob/main/config/notifiers.go#L597-L607. We could expand that logic to cover all relevant fields, but that code would overlap a lot with the existing code in the opsgenie notifier, and we would want to keep the two code paths in sync. I think it will be easier to keep validation in sync with notification if we share more code, as in the draft I linked above, but I don't mind implementing either version if you all have a different preference.

grobinson-grafana commented 1 year ago

but I don't know that we need to block improving validation for one notifier on improving validation for all of them. What do you think?

I think @simonpasquier and @gotjosh should answer that. I suspect the issue will be if its not done in this PR, then there is a good chance it will never get done for the other notifiers. This would leave amtool with the feature for just opsgenie, and then someone else would need to finish it for the other 12 or so notifiers.

jmcarp commented 1 year ago

Submitted a draft PR at #3494.

ricochet1k commented 7 months ago

I'm having a similar issue with PagerDuty, where the template is valid but I accidentally tried to do something like {{ .CommonLabels }} inside {{ range .Alerts.Firing }}, which of course ended up with something like this:

level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="team-infra-page/pagerduty[0]: notify retry canceled due to unrecoverable error after 1 attempts: \"firing\": failed to template \"{{ template \\\"infra.pagerduty.firing\\\" . }}\": template: infra_pagerduty.tmpl:16:77: executing \"infra.pagerduty.firing\" at <.CommonLabels.Names>: can't evaluate field CommonLabels in type template.Alert"

It'd be nice if the template rendering could just use the error string instead of failing to alert altogether!