Closed marcindulak closed 5 years ago
We use the sentry_open_issue_events
to generate alarms if a certain error happens a certain amount of times. So for instance if the same error has more than 30 events, we send an alarm via slack/mail/phone. This alarm is just for one issue.
Another alarm is triggered when the sum of the events per project increases above a certain threshold. This is done via sum(sentry_open_issue_events) by (project_name)
. So this covers the case when the before mentioned threshold is not reached, but the sum of different issues is bad.
What do you mean with "rates of errors"? What do you want to visualize or be alarmed for?
Rate of errors - I mean rate of events sentry_open_issue_events
.
Correct me if I'm wrong, but to me sentry_open_issue_events
looks like a counter (I haven't look though into the code). See the graph below of the query sentry_open_issue_events{project_slug="my-service","issue_link"="something"}
from the last week.
Therefore I'm trying to calculate the rate (https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) of that counter topk(3, rate(sentry_open_issue_events{project_slug="my-service"}[5m]))
, and present this on grafana dashboard. Almost all the time the rates will be zero, but when there is a new event, the issue_last_seen
label changes creating a new time series,
and prometheus won't calculate rate between such two different time series. As a consequence all the rates are zero.Therefore I want to remove issue_last_seen
from the labels, so the time series for the given issue_link
continues.
We use it to display different graphs in Grafana. One is sum(sentry_open_issue_events) by (project_name)
which is good for alarming the project team. Another one lists all open issues in a table with the newest ones on top. Hence, we need the last seen label for sorting here.
Do you just need to visualize something or to generate alarms with it? Im still not really understanding why you need a rate when you can draw the graph with the actual numbers. What shall this rate tell you about your application? (just not getting the use case)
Assuming it is a counter, the problem with an ever increasing counter is that I cannot setup a common alerting rule or visualize on a common, summary grafana graph. The counters for different project_slug
may differ by orders of magnitude in the number of sentry_open_issue_events
. With rate I can setup an alert when e.g. 60 * 5 * rate(sentry_open_issue_events{project_slug="my-service"}[5m]) > 3
meaning there have been about 3 events during the last 5 minutes.
Maybe we could have separate metrics that does not include the issue_last_seen
label?
Or maybe two different metrics: one with only issue_first_seen
and another with only issue_last_seen
?
I remember that prometheus discourages using the metrics for "logging", and one could argue issue_last_seen
goes into the direction of logging: apart from the value of the metrics (the number of events) it provides an additional, variable information. Below are some references to consider, not sure how much applicable to this case:
https://github.com/prometheus/prometheus/issues/3780#issuecomment-362223310 https://www.robustperception.io/existential-issues-with-metrics https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels
Ok, I see. The anti-loggern pattern is a problem, that's true. As "first seen" does not change in the error lifetime, leaving out just last seen would be enough. As we still need this for a table view, not the graph, I thought about making it configurable via the GET call to the metrics endpoint. Like &include_labels=last_seen So if this flag is set, the label is included, default would be false/no other labels than the default ones.
So the "default" case then is working for alarming, rates and whatever and if this flag is set, the given labels are provided. You think this is a good idea?
Yes, I think that will work. I guess you will make the extra labels configurable through an env variable setting like SENTRY_INCLUDE_EXTRA_LABELS=issue_last_seen,...
?
Other related questions:
environment
available)?issue_last_seen
is not continuous https://github.com/ujamii/prometheus-sentry-exporter/issues/5#issuecomment-495326507?@mgrundkoetter - have you had time to look into this?
It seems also that I see why the metrics is non-continuous. I'm getting often "Context deadline exceeded" from the prometheus-sentry-exporter target meaning my global scrape_timeout
https://github.com/prometheus/prometheus/issues/4561 of 10s is too low. I can see that 10s is a typical time to fetch the events time curl -L localhost/metrics
. It returns about 400 sentry_open_issue_events
every time.
No, I had no time to change the code, yet. But you can adjust the PR if you like. I'm on vacation now, so maybe I can look at it by the end of June. My scrape interval and timeout is set to 30s, which seems sufficient.
You can drop this labels during scrape with config like this:
metric_relabel_configs:
- action: labeldrop
regex: issue_first_seen|issue_last_seen
I guess if it is that easy to drop the label during scrape, I can close the issue.
I'm trying to calculate rates of errors based on
sentry_open_issue_events
, but the metrics is non-continuous. I guess due to this I'm unable to calculate rates, the closest thing to a rate I got istopk(3, sentry_open_issue_events - ignoring(issue_last_seen, instance, pod) sentry_open_issue_events offset 10m)
, but this is still non-continuous. Do you have an opinion about this? What is you use case forsentry_open_issue_events
?It looks like the problem be due to the
issue_last_seen
label. Every time a new series is created for a newissue_last_seen
. After removingissue_last_seen
https://github.com/ujamii/prometheus-sentry-exporter/blob/872f7f7834bf87c74b77c432a59cbd22595a40f8/src/SentryExporter.php#L64 I can calculate rates, e.g.topk(3, rate(sentry_open_issue_events{project_slug="my-service"}[5m])) > 0
, but when I click a couple of time on prometheus Console sometimes I'm getting "no data" for even a simpletopk(3, sentry_open_issue_events{project_slug="my-service"})
query. Could you verify this, and possibly removeissue_last_seen
?