matrix-org / go-neb

Extensible matrix bot written in Go
Apache License 2.0
283 stars 91 forks source link

Prometheus GeneratorURL has very poor user experience.. what can we do about this? #343

Open ForestJohnson opened 3 years ago

ForestJohnson commented 3 years ago

Is your feature request related to a problem? Please describe.

One day I got an alert:

image

This alert has a link on it which comes from the prometheus GeneratorURL; basically, this URL takes the promql query for the alert and slams it into the Prometheus metric graphing UI.

I was used to working in an environment where, when there is an alert, the link to the metric graph on that alert will show the metric that triggered the alert. Seems reasonable right?

However, this is not the case with prometheus GeneratorURL. Prometheus' UI interprets the query used for the alert quite literally while graphing: It only displays the line for the metric where the condition was true. Especially because the 1st metric is yellow on a white background by default, this is practically designed to give users brain cancer.

image

In fact, when I saw this, my 1st assumption was that the server was stuck at 100% CPU and therefore it was only able to produce metrics 10% of the time.

Long story short, this is what I was expecting to see when I clicked that link:

image

The metric which was causing the alert, in isolation, in focus, at the time the alert happened, and displayed in its entirety. That's what we want to see. We Have The Technology.

Describe the solution you'd like

So, I actually went ahead and implemented this because it really rustled my jimmies how terribad the prometheus UX design is. No wonder everyone uses Grafana. Prometheus works, but its usefulness in this case is being hobbled by tech elitism/widespread acceptance of medically-dangerously-bad ux.

I had two ideas for how to fix this:

  1. Modify go-neb so that it fixes the prometheus GeneratorURL before sending it out to the users.
    • use the Prometheus open source code to parse the query string into a query object
    • modify the query object to include extra filter parameters based on the tags that come in on the alert object
    • drop the comparison operator
    • serialize the query back

Or the other option:

  1. Modify the configuration files (prometheus alert definition yml and go-neb html_template configuration value)
    • add an annotation called clickthroughQuery to the alert definition, this annotation will be a modified version of the alert query which is optimized for display (it doesn't have the conditional operator on the outside, and it has the label matchers to only display the series that triggered the alert)
    • change the template in the go-neb config to hard-code a prometheus URL + the value from the clickthroughQuery annotation

I went with my solution number 1 for a few reasons:

Describe alternatives you've considered

Realistically this should be fixed in prometheus :shrug: but something tells me that would be a hard sell. This fix works for us and I hope maybe it can be useful to someone else.

Additional context

I just finished haphazardly deploying the first working version, check it out!!:

https://giit.cyberia.club/~forest/go-neb/log/forest-feature-rebase-2

This is the commit showcasing the meat of the solution: https://giit.cyberia.club/~forest/go-neb/commit/1f7ecc4ac400e2ac66334b395ee3eb8b0377911a

I also "created" this "library" (if you can call it that) to get around go module package woes related to multiple versions of Prometheus being imported at once: https://giit.cyberia.club/~forest/promql-parser

Unfortunately my solution is sort of "hard-coded" to match our setup. While I assume it would probably work for most people, I very much doubt it would work for everyone, and could even accidentally make the situation worse. For example, if it somehow gets too aggressive with label matchers and filters out all metrics. Or if it trips on a complicated query and doesn't adjust it quite right, displaying confusing results. But I believe there is a lot of room for improvement and at very least the part where it sets the graph to be visible by default with the time frame already set correctly could be safely applied to all go-neb alertmanager GeneratorURLs.

biox commented 2 years ago

the git URL is out of date :3 - here's the updated version: https://git.cyberia.club/cyberia/go-neb/src/branch/forest-feature-rebase-2