Open srueg opened 3 years ago
Personal opinion incoming:
Not sure if it's something that k8up should handle. Alerting rules are something very use case specific. The use case you have may not apply for all users. The rules themselves have been out of scope, we only provided a set of default rules, that are now integrated in the component.
The component handles the alerting rules already so maybe invert the requirement. The component could be enhance in such a way that you can define schedules, it will create them and the correct rule for the schedule.
The idea behind this is that the current default rules are more or less unusable due to too many false positives and as you mentioned different use cases (e.g. hourly vs. daily schedules). And since the operator knows about the schedules it would have the necessary information to create useful rules. I understand that it might be a stretch to make it the operator's problem but then again I think it would be a very common use case since a lot of users want alerts for failing/skipped backups.
Implementing it in the component would indeed be an option as well. This would then only work for schedules created via the component or it's helper Jsonnet lib and not for others. Also it might be quite challenging to implement such a logic in Jsonnet.
I'd also suggest to stop calling them defaults but rather examples. As they only work for schedules that are once a day as you've found out.
We also have to consider that if we add such rules to the schedule API, we'll add a hard dependency on Prometheus Operator and become subject to their API changes.
What if we instead of directly adding Prometheus rule spec to the schedule spec, add a reference to a ConfigMap that contains YAML document for each object to be generated. Similar to Espejo, K8up would then search and replace special variables with properties from the schedule spec? The difference is that such a ConfigMap gives the flexibility you seek, but K8up itself remains unopinionated about which objects?
How about monitoring mixin? a unified approach that fits any prometheus configuration. see node-mixin as example https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin
Summary
As K8up user I want to have Prometheus alerting rules So that I get notified if backups are failing
Context
Currently K8up only provides Prometheus metrics but no alerting rules. We have a set of default rules in the Commodore component: https://github.com/projectsyn/component-backup-k8up/blob/master/class/defaults.yml#L99 These rules are too generic and might lead to many false positives or might miss failed backups altogether. Since K8up has all the necessary context to know what could be a good threshold for alerting, it could also generate said alerting rules.
Out of Scope
Further links
Acceptance criteria
Implementation Ideas