ujamii / prometheus-sentry-exporter

Exports sentry project metrics for prometheus.
MIT License
38 stars 12 forks source link

question about non-continuous metrics #5

Closed marcindulak closed 5 years ago

marcindulak commented 5 years ago

I'm trying to calculate rates of errors based on sentry_open_issue_events, but the metrics is non-continuous. I guess due to this I'm unable to calculate rates, the closest thing to a rate I got is topk(3, sentry_open_issue_events - ignoring(issue_last_seen, instance, pod) sentry_open_issue_events offset 10m), but this is still non-continuous. Do you have an opinion about this? What is you use case for sentry_open_issue_events?

Screenshot from 2019-05-20 16-38-23_redacted

It looks like the problem be due to the issue_last_seen label. Every time a new series is created for a new issue_last_seen. After removing issue_last_seen https://github.com/ujamii/prometheus-sentry-exporter/blob/872f7f7834bf87c74b77c432a59cbd22595a40f8/src/SentryExporter.php#L64 I can calculate rates, e.g. topk(3, rate(sentry_open_issue_events{project_slug="my-service"}[5m])) > 0, but when I click a couple of time on prometheus Console sometimes I'm getting "no data" for even a simple topk(3, sentry_open_issue_events{project_slug="my-service"}) query. Could you verify this, and possibly remove issue_last_seen?

mgrundkoetter commented 5 years ago

We use the sentry_open_issue_events to generate alarms if a certain error happens a certain amount of times. So for instance if the same error has more than 30 events, we send an alarm via slack/mail/phone. This alarm is just for one issue. Another alarm is triggered when the sum of the events per project increases above a certain threshold. This is done via sum(sentry_open_issue_events) by (project_name). So this covers the case when the before mentioned threshold is not reached, but the sum of different issues is bad.

What do you mean with "rates of errors"? What do you want to visualize or be alarmed for?

marcindulak commented 5 years ago

Rate of errors - I mean rate of events sentry_open_issue_events.

Correct me if I'm wrong, but to me sentry_open_issue_events looks like a counter (I haven't look though into the code). See the graph below of the query sentry_open_issue_events{project_slug="my-service","issue_link"="something"} from the last week.

Screenshot from 2019-05-23 19-48-58

Therefore I'm trying to calculate the rate (https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) of that counter topk(3, rate(sentry_open_issue_events{project_slug="my-service"}[5m])), and present this on grafana dashboard. Almost all the time the rates will be zero, but when there is a new event, the issue_last_seen label changes creating a new time series, and prometheus won't calculate rate between such two different time series. As a consequence all the rates are zero.Therefore I want to remove issue_last_seen from the labels, so the time series for the given issue_link continues.

mgrundkoetter commented 5 years ago

We use it to display different graphs in Grafana. One is sum(sentry_open_issue_events) by (project_name) which is good for alarming the project team. Another one lists all open issues in a table with the newest ones on top. Hence, we need the last seen label for sorting here. Do you just need to visualize something or to generate alarms with it? Im still not really understanding why you need a rate when you can draw the graph with the actual numbers. What shall this rate tell you about your application? (just not getting the use case)

marcindulak commented 5 years ago

Assuming it is a counter, the problem with an ever increasing counter is that I cannot setup a common alerting rule or visualize on a common, summary grafana graph. The counters for different project_slug may differ by orders of magnitude in the number of sentry_open_issue_events. With rate I can setup an alert when e.g. 60 * 5 * rate(sentry_open_issue_events{project_slug="my-service"}[5m]) > 3 meaning there have been about 3 events during the last 5 minutes.

Maybe we could have separate metrics that does not include the issue_last_seen label? Or maybe two different metrics: one with only issue_first_seen and another with only issue_last_seen?

I remember that prometheus discourages using the metrics for "logging", and one could argue issue_last_seen goes into the direction of logging: apart from the value of the metrics (the number of events) it provides an additional, variable information. Below are some references to consider, not sure how much applicable to this case:

https://github.com/prometheus/prometheus/issues/3780#issuecomment-362223310 https://www.robustperception.io/existential-issues-with-metrics https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels

mgrundkoetter commented 5 years ago

Ok, I see. The anti-loggern pattern is a problem, that's true. As "first seen" does not change in the error lifetime, leaving out just last seen would be enough. As we still need this for a table view, not the graph, I thought about making it configurable via the GET call to the metrics endpoint. Like &include_labels=last_seen So if this flag is set, the label is included, default would be false/no other labels than the default ones.

So the "default" case then is working for alarming, rates and whatever and if this flag is set, the given labels are provided. You think this is a good idea?

marcindulak commented 5 years ago

Yes, I think that will work. I guess you will make the extra labels configurable through an env variable setting like SENTRY_INCLUDE_EXTRA_LABELS=issue_last_seen,...?

Other related questions:

marcindulak commented 5 years ago

@mgrundkoetter - have you had time to look into this?

marcindulak commented 5 years ago

It seems also that I see why the metrics is non-continuous. I'm getting often "Context deadline exceeded" from the prometheus-sentry-exporter target meaning my global scrape_timeout https://github.com/prometheus/prometheus/issues/4561 of 10s is too low. I can see that 10s is a typical time to fetch the events time curl -L localhost/metrics. It returns about 400 sentry_open_issue_events every time.

mgrundkoetter commented 5 years ago

No, I had no time to change the code, yet. But you can adjust the PR if you like. I'm on vacation now, so maybe I can look at it by the end of June. My scrape interval and timeout is set to 30s, which seems sufficient.

gordonbondon commented 5 years ago

You can drop this labels during scrape with config like this:

metric_relabel_configs:
  - action: labeldrop
    regex: issue_first_seen|issue_last_seen
mgrundkoetter commented 5 years ago

I guess if it is that easy to drop the label during scrape, I can close the issue.