Add ability to generate png graph on the alert for embedding with receivers

garo commented 5 years ago

Both Slack and PagerDuty allows including one or more images with the generated alert. At least with PagerDuty the image needs to be accessible via https.

Alertmanager should have a way to generate a publicly accessible rendered image on the alert query, so that the image can be attached to the alert. This way the person receiving the alert could easily see a visual explanation on the alert context.

The generated image could at least show the alert expression, separately for the left and right side of the expression.

After thinking this a bit there's one way how this could be implemented:

Extend the Alert model to include the expression, perhaps broken down to left and right parts of the comparison.
Create a template language which generates a cryptographically signed url for a prometheus query, which would then be passed as a parameter from the Alert model mentioned above.
Create a http endpoint to Alertmanager which would render a png if provided with a correctly signed query. Signing is required because the alertmanager endpoint would need to be exposed to public internet for PagerDuty and Slack to be able to connect to it.

These steps would enable to construct appropriate alertmanager receiver route rules, which would be able to generate required image urls with the mentioned templating functionality.

roidelapluie commented 5 years ago

We have that feature in a custom we hook reveiver but not with the original alert; we do add custom promql expressions as annotations

mxinden commented 5 years ago

I will break this down into two feature requests:

Rendering images in Alertmanager: Alertmanager is notified by Prometheus that a given alert is firing. It does not send along the corresponding timeseries nor an image of the graph. Thereby Alertmanager by itself can not render the image.

One could expose an endpoint on Prometheus that would return an image given a Prometheus Query, which could then be combined with the Alert from Alertmanager. While this is technically possible I think this would include several problems:
- Separation of concerns: Prometheus is a monitoring tool and time series database. All graphic-features have been kept as minimal as possible so far.
- Rendering images of graphs is difficult to achieve in a generic way as there are so many options (file format, size, ...).
Hence I think using a custom webhook as @roidelapluie suggested is a great option. It would receive the alert from Alertmanager, retrieve the time series from Prometheus, render the graphic and send everything to your notification-system.
HTTP endpoint accepting signed queries: The Prometheus project does not provide authn authz or encryption features as of today. This might change but is unlikely to happen any time soon.

@garo what do you think?

roidelapluie commented 5 years ago

The main reason for not accepting this feature request is that the alerting rule is not relevant to understand the context.

Some alerts have >, on() joins... and when it is resolved you can not know if it was because of missing mectric or threshold... the alerting query is meaningless.

garo commented 5 years ago

I understand @roidelapluie your reasoning that there are many cases where the alerting rule is not easily usable for this case.

Could the alerting rule have a field (label / annotation) where the user could describe a relevant query, which could then be rendered out and sent as an attachment? This approach would require the ability to specify multiple queries which would then be embedded in the same image.

Remember that the goal is not replicate a full featured visualisation backend such as Grafana, but just to provide a first look for the on-call engineer to get a context.

@mxinden How about Alertmanager would be the component doing the actual image rendering and not Prometheus? It is true that adding image rendering components (I'm not sure how the current prometheus UI handles image rendering? Is it done fully client side with javascript?) isn't trivial.

Another option would be to handle the image rendering to an external service (such as grafana, which does have this kind of API) and then just let Alertmanager to combine all this together. In this case Alertmanager would know to call external rendering service as an user configurable URL (use templating language to construct appropriate url to be sent to the rendering service), download the created image as an attachment and then somehow expose that to public internet, for example as a random generated static url.

roidelapluie commented 5 years ago

https://github.com/qvl/promplot

stuartnelson3 commented 5 years ago

Could the alerting rule have a field (label / annotation) where the user could describe a relevant query, which could then be rendered out and sent as an attachment? This approach would require the ability to specify multiple queries which would then be embedded in the same image.

This is something you could create yourself using a webhook. The notification goes to the webhook, which creates an image+url to reference the image.

Another option would be to handle the image rendering to an external service (such as grafana, which does have this kind of API) and then just let Alertmanager to combine all this together. In this case Alertmanager would know to call external rendering service as an user configurable URL (use templating language to construct appropriate url to be sent to the rendering service), download the created image as an attachment and then somehow expose that to public internet, for example as a random generated static url.

If your webhook creates a url that you can "know beforehand", you could set it as an annotation on the notification.

In general, I'm not in favor of adding this specific behavior natively to alertmanager.

garo commented 5 years ago

Thanks for your inputs.

After looking qvl/promplot and webhooks I see that I could build the feature with adding some custom code and an external service or two, but not without adding more moving parts and making the alerting more error prone.

I understand that adding this kind of feature is a bit awkward in the current state how Prometheus and Alertmanager is built, so I'm closing this issue now.

I still believe that the user story "As a person receiving an alert I want to see an embedded graph image showing the history on the metric before it triggered the alert" is still valid. If somebody can think of a better way to implement this story please open a new ticket.

cameronkerrnz commented 5 years ago

For the likes of Slack, I would suggest a chatbot that grabbed such predefined dashboards. Say you're monitoring your website and you get a lot of 502 responses, I could imagine a chatbot that could respond with images. To illustrate very loosely...

bot: dude, there's lots of 502/503 responses from the website, and its freaking me out
me: spiders?
bot: here's a graph showing breakdown of web crawlers and top user-agents
/me doesn't see anything obvious there...
me: ips?
bot: here's a graph showing the top ips over time
/me sees unusually high activity from a certain IP
me: ban ip x.x.x.x
bot: intiating Ansible playbook to ban IP x.x.x.x for the default of 6 hours and recording that @me requested this action at <timestamp>
/me goes back to watch movie in peace

That does also show that you could then roll in data from logs as well as metrics, and also intiate routine responses.

If that doesn't satisfy your needs (it is a bit of a stretch-goal), you could consider moving the alerting functionality to Grafana, which could have the advantage of one alerting system that could cover Prometheus, Elasticsearch, and others..... (I haven't investigated that, but I was pretty interested by a screenshot of Slack showing a Grafana graph.)

aantn commented 2 years ago

@garo @cameronkerrnz we built that chatbot (well, something similar anyway) and it can show:

a graph for the currently firing alert
an export of any grafana dashboard
logs for the pod on which the alert is firing
the output of any cli command

...we essentially built a library of 50+ prebuilt webhooks that you can configure with YAML.

It's open source and I would love to hear some ideas for what we're misssing or how we can make it easier. (Leave a comment here, message natan at robusta dot dev, or message me on our Slack.)

prometheus / alertmanager

Add ability to generate png graph on the alert for embedding with receivers #1636