Is your feature request related to a problem? Please describe.
One day I got an alert:
This alert has a link on it which comes from the prometheus GeneratorURL; basically, this URL takes the promql query for the alert and slams it into the Prometheus metric graphing UI.
I was used to working in an environment where, when there is an alert, the link to the metric graph on that alert will show the metric that triggered the alert. Seems reasonable right?
However, this is not the case with prometheus GeneratorURL. Prometheus' UI interprets the query used for the alert quite literally while graphing: It only displays the line for the metric where the condition was true. Especially because the 1st metric is yellow on a white background by default, this is practically designed to give users brain cancer.
In fact, when I saw this, my 1st assumption was that the server was stuck at 100% CPU and therefore it was only able to produce metrics 10% of the time.
Long story short, this is what I was expecting to see when I clicked that link:
The metric which was causing the alert, in isolation, in focus, at the time the alert happened, and displayed in its entirety. That's what we want to see. We Have The Technology.
Describe the solution you'd like
So, I actually went ahead and implemented this because it really rustled my jimmies how terribad the prometheus UX design is. No wonder everyone uses Grafana. Prometheus works, but its usefulness in this case is being hobbled by tech elitism/widespread acceptance of medically-dangerously-bad ux.
I had two ideas for how to fix this:
Modify go-neb so that it fixes the prometheus GeneratorURL before sending it out to the users.
use the Prometheus open source code to parse the query string into a query object
modify the query object to include extra filter parameters based on the tags that come in on the alert object
drop the comparison operator
serialize the query back
Or the other option:
Modify the configuration files (prometheus alert definition yml and go-neb html_template configuration value)
add an annotation called clickthroughQuery to the alert definition, this annotation will be a modified version of the alert query which is optimized for display (it doesn't have the conditional operator on the outside, and it has the label matchers to only display the series that triggered the alert)
change the template in the go-neb config to hard-code a prometheus URL + the value from the clickthroughQuery annotation
I went with my solution number 1 for a few reasons:
We still have to add the correct timeframe into the URL -- there was no way to do this with pure config so i had to make code changes anyways
Having to maintain 2 copies of the same query on your alerts also falls under the category of "ux that can end your marriage"
parsing promql and modifying it in middleware, its very "swag" :sunglasses:
Describe alternatives you've considered
Realistically this should be fixed in prometheus :shrug: but something tells me that would be a hard sell. This fix works for us and I hope maybe it can be useful to someone else.
Additional context
I just finished haphazardly deploying the first working version, check it out!!:
I also "created" this "library" (if you can call it that) to get around go module package woes related to multiple versions of Prometheus being imported at once: https://giit.cyberia.club/~forest/promql-parser
Unfortunately my solution is sort of "hard-coded" to match our setup. While I assume it would probably work for most people, I very much doubt it would work for everyone, and could even accidentally make the situation worse. For example, if it somehow gets too aggressive with label matchers and filters out all metrics. Or if it trips on a complicated query and doesn't adjust it quite right, displaying confusing results. But I believe there is a lot of room for improvement and at very least the part where it sets the graph to be visible by default with the time frame already set correctly could be safely applied to all go-neb alertmanager GeneratorURLs.
Is your feature request related to a problem? Please describe.
One day I got an alert:
This alert has a link on it which comes from the prometheus GeneratorURL; basically, this URL takes the promql query for the alert and slams it into the Prometheus metric graphing UI.
I was used to working in an environment where, when there is an alert, the link to the metric graph on that alert will show the metric that triggered the alert. Seems reasonable right?
However, this is not the case with prometheus GeneratorURL. Prometheus' UI interprets the query used for the alert quite literally while graphing: It only displays the line for the metric where the condition was true. Especially because the 1st metric is yellow on a white background by default, this is practically designed to give users brain cancer.
In fact, when I saw this, my 1st assumption was that the server was stuck at 100% CPU and therefore it was only able to produce metrics 10% of the time.
Long story short, this is what I was expecting to see when I clicked that link:
The metric which was causing the alert, in isolation, in focus, at the time the alert happened, and displayed in its entirety. That's what we want to see. We Have The Technology.
Describe the solution you'd like
So, I actually went ahead and implemented this because it really rustled my jimmies how terribad the prometheus UX design is. No wonder everyone uses Grafana. Prometheus works, but its usefulness in this case is being hobbled by tech elitism/widespread acceptance of medically-dangerously-bad ux.
I had two ideas for how to fix this:
Or the other option:
html_template
configuration value)I went with my solution number 1 for a few reasons:
Describe alternatives you've considered
Realistically this should be fixed in prometheus :shrug: but something tells me that would be a hard sell. This fix works for us and I hope maybe it can be useful to someone else.
Additional context
I just finished haphazardly deploying the first working version, check it out!!:
https://giit.cyberia.club/~forest/go-neb/log/forest-feature-rebase-2
This is the commit showcasing the meat of the solution: https://giit.cyberia.club/~forest/go-neb/commit/1f7ecc4ac400e2ac66334b395ee3eb8b0377911a
I also "created" this "library" (if you can call it that) to get around go module package woes related to multiple versions of Prometheus being imported at once: https://giit.cyberia.club/~forest/promql-parser
Unfortunately my solution is sort of "hard-coded" to match our setup. While I assume it would probably work for most people, I very much doubt it would work for everyone, and could even accidentally make the situation worse. For example, if it somehow gets too aggressive with label matchers and filters out all metrics. Or if it trips on a complicated query and doesn't adjust it quite right, displaying confusing results. But I believe there is a lot of room for improvement and at very least the part where it sets the graph to be visible by default with the time frame already set correctly could be safely applied to all go-neb alertmanager GeneratorURLs.