Open jmcarp opened 1 year ago
Hi! 👋 Unless I've misread the issue, it looks like the template was valid but that the text output from the template contained unicode that Opsgenie does not support? How would amtool check for this?
Sorry I was unclear about this! This is a templating error that happens before we even make a request to the opsgenie api. So aiui it should be possible to detect statically.
Apologies, I have indeed misread the issue! What is the output of amtool when testing this template with the same data?
@jmcarp can you share the config that triggered the error?
Here's a simplified version:
route:
receiver: opsgenie
receivers:
- name: opsgenie
opsgenie_configs:
- send_resolved: false
api_key: redacted
message: '{{ .GroupLabels.alertname }'
There's a missing closing brace in the message template, so all opsgenie notifications will fail during templating, before we send a request to the api. And we don't detect any problems with amtool, because aiui amtool doesn't look at templated fields within receivers (although it does look at the top-level templates field, which I'm not using here):
❯ amtool check-config am.yaml
Checking 'am.yaml' SUCCESS
Found:
- global config
- route
- 0 inhibit rules
- 1 receivers
- 0 templates
So to restate the ask in context: I would like amtool to throw an error on this example configuration file, and/or I would like the opsgenie notifier to optionally log a warning and omit invalid fields from the request, but not block notification on template errors. I'm happy to write a patch for either feature if maintainers are open to it.
I think what's happening here is that am.yaml
contains an inline template in the message field rather than executing a template in templates
, and inline templates are not checked.
Here is an example of how it would look using template files:
route:
receiver: opsgenie
receivers:
- name: opsgenie
opsgenie_configs:
- send_resolved: false
api_key: redacted
message: '{{ template "opsgenie.message" . }}'
templates:
- tmpls/*.tmpl
That said, I think your argument is still valid because you could also have an error in the inline template {{ template "opsgenie.message" . }}
.
Validating fields supporting inline templates when loading the Alertmanager configuration would definitely be a good addition.
It looks like each notifier handles inline templating differently. What do you think about adding a validator interface that notifiers can optionally implement, and which we can check via amtool check-config
?
type Validator interface {
Validate() error
}
Then in the opsgenie package:
func (n *Notifier) Validate() error {
...
}
This method could check for invalid template strings, but also do other per-notifier validation as needed.
I wonder if the validation could be performed in each notifier's New() since it gets the template engine as a parameter? And amtool check-config
could replicate/mimic the buildReceiverIntegrations() function.
Another option, although I'm not sure how viable, could we annotate the config structs and then use reflect
?
For example:
type SlackConfig struct {
NotifierConfig `yaml:",inline" json:",inline"`
...
Title string `yaml:"title,omitempty" json:"title,omitempty" template:"optional"`
TitleLink string `yaml:"title_link,omitempty" json:"title_link,omitempty" template:"optional"`
Pretext string `yaml:"pretext,omitempty" json:"pretext,omitempty" template:"optional"`
Text string `yaml:"text,omitempty" json:"text,omitempty" template:"optional"`
...
}
I started drafting a simple validation approach in https://github.com/prometheus/alertmanager/compare/main...jmcarp:alertmanager:validate-opsgenie-config?expand=1. The idea is to exercise opsgenie template rendering without sending a request in New(), which is how I interpreted @simonpasquier's suggestion. I think I prefer this to annotating config structs, since the validation method exercises the same logic that's used when we actually send notifications.
What do you think?
I see a couple of issues:
You'll need to implement the Validate
method, and write tests, for each supported notifier. I think there are about 14 in total. That's quite a lot of work.
It looks like you not only want to check the syntax of inline templates but also execute them to find not just syntax but also runtime errors. However, that's not what check-config
does for template files, and this task is instead left to amtool template render
. I think if you want to execute inline templates via check-config
then you'll also need to do the same for template files too?
Following the suggestion from @simonpasquier, I added validation logic to the opsgenie notifier constructor rather than adding a new required Validate method. I agree that all notifiers should validate their configuration statically as thoroughly as possible, but I don't know that we need to block improving validation for one notifier on improving validation for all of them. What do you think?
As for template files, I'm happy to make a change in check-config
if useful. Although if I understand correctly, top-level templates are meant to be used in receiver templates, so rendering receiver templates in check-config
may address that issue for free.
I also noticed that the opsgenie notifier's config struct does some partial template validation in its UnmarshalYAML method: https://github.com/prometheus/alertmanager/blob/main/config/notifiers.go#L597-L607. We could expand that logic to cover all relevant fields, but that code would overlap a lot with the existing code in the opsgenie notifier, and we would want to keep the two code paths in sync. I think it will be easier to keep validation in sync with notification if we share more code, as in the draft I linked above, but I don't mind implementing either version if you all have a different preference.
but I don't know that we need to block improving validation for one notifier on improving validation for all of them. What do you think?
I think @simonpasquier and @gotjosh should answer that. I suspect the issue will be if its not done in this PR, then there is a good chance it will never get done for the other notifiers. This would leave amtool with the feature for just opsgenie, and then someone else would need to finish it for the other 12 or so notifiers.
Submitted a draft PR at #3494.
I'm having a similar issue with PagerDuty, where the template is valid but I accidentally tried to do something like {{ .CommonLabels }}
inside {{ range .Alerts.Firing }}
, which of course ended up with something like this:
level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="team-infra-page/pagerduty[0]: notify retry canceled due to unrecoverable error after 1 attempts: \"firing\": failed to template \"{{ template \\\"infra.pagerduty.firing\\\" . }}\": template: infra_pagerduty.tmpl:16:77: executing \"infra.pagerduty.firing\" at <.CommonLabels.Names>: can't evaluate field CommonLabels in type template.Alert"
It'd be nice if the template rendering could just use the error string instead of failing to alert altogether!
What did you do?
Mis-configured a notification such that a template string was invalid and could not be rendered.
What did you expect to see?
amtool check-config
to detect template errors in notifier configuration, and/orWhat did you see instead? Under which circumstances?
Validation with
amtool
didn't show any errors, and notifications were dropped.Environment
We're running alertmanager 0.25, but I believe this issue affects all versions.
Logs look like this:
alertmanager ts=2023-08-25T20:43:03.251Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="opsgenie-receiver/opsgenie[0]: notify retry canceled due to unrecoverable error after 1 attempts: templating error: template: :1: bad character U+002D '-'"
I would be happy to write a pr to either cause amtool to check for invalid templates or to enable notifiers to send partial notifications on template errors—for example, dropping fields that fail to render. Or if there's a better approach, let me know!