mozilla / opmon

Operational Monitoring (OpMon) 📈
Mozilla Public License 2.0
7 stars 5 forks source link

Alerts table appears to not include correct `parameter` value for `avg_diff` #142

Open ncalexan opened 1 year ago

ncalexan commented 1 year ago

In the documentation, there's an avg_diff example that includes percentiles:

[alerts.historical_diff]
# Deviation from historical data:
# an alert is triggered if the average of the specified window deviates
# from the average of the previous window
type = "avg_diff"
metrics = [  # metrics to monitor
    "memory_total",
]
window_size = 7 # window size in days
max_relative_change = 0.5   # relative change that when exceeded triggers an alert
percentiles = [50, 90]  # percentiles to monitor

It's my understanding that such a percentile translates into the parameter of the alerts table. But when I do this locally, my temporary table does have alerts but does not have non-NULL parameter values.

Or maybe what is happening is that there is a data-modeling mismatch. My data is a sum of binary events: https://github.com/mozilla/metric-hub/blob/90ffefb18f3dfd9ecdf2f6bb9a9bd68e30f3de4e/opmon/firefox-uninstalls.toml#L62-L73. This statistic has a NULL parameter.

But I'm forced to specify percentiles in my avg_diff alert. Or maybe I can drop the percentiles?

This is all very confusing. The fact that alerts are not based on statistics, but instead directly on the underlying metrics, and that the alerts do subtle calculations without providing intermediate tables to examine (and visualize!) is making this difficult to understand and use.

┆Issue is synchronized with this Jira Task

ncalexan commented 1 year ago

OK, I've dug deeper into this using #144, and if I'm reading the SQL correctly, I think:

That seems ripe for improvement.