Open dvyukov opened 1 month ago
GCP offers the alerting functionality: https://cloud.google.com/monitoring/alerts (also based on the logs https://cloud.google.com/logging/docs/alerting/log-based-alerts)
It would be nice to figure out how to keep these settings in the git repository and be able to (re-)deploy them without having to go through the Cloud web UI interface.
If we want to reply on logs grepping, I think for reliability of parsing we will need a new interface along the lines of:
package log
func Metricf(typ Metric, description string, args ...any)
type Metric string
Otherwise matching logs is unreliable.
And, yes, it would be good to persist rules somewhere.
We have lots of health indicators that can be evaluated only over a time period, for example:
All of these can't be diagnosed at the instant (a single dashboard error, or a single repro failure may be ignored), and currently we don't do any monitoring for any of these (besides random wandering around).
We should collect data for these and at least visualize (e.g. rate of successful/failed bug reproductions, dashboard errors per day), and ideally maybe alert on sudden changes. Some alerts may be based on threshold (easier, e.g. >100 dashboard errors/day).