giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Alerts bound to apps not teams #1544

Open teemow opened 1 year ago

teemow commented 1 year ago

Instead of alert rules by team use alert routing to keep the alerts bound to apps not teams.

Problem: We have alerts based on team names. Whenever we change the ownership of a component or a team name we now have to change the team label in the app and also the alert rule.

It would be much better if we can route the alert based on the team and keep the rules more generic or rather specific to an app.

Example:

    - alert: ManagementClusterContainerIsRestartingTooFrequentlyAWS
      expr: increase(kube_pod_container_status_restarts_total{label_application_giantswarm_io_team="phoenix"}[1h]) > 6
      for: 5m
      labels:
        area: kaas
        severity: page
        team: phoenix

The above alert rule can imo be provider independent if we define it like this:

    - alert: ManagementClusterContainerIsRestartingTooFrequently
      expr: increase(kube_pod_container_status_restarts_total[1h]) > 6
      for: 5m
      labels:
        area: kaas
        severity: page
        team: {{ $labels.label_application_giantswarm_io_team }}

I haven't tested this but imo the solution should be possible in a similar fashion.

We can still build alert rules that are more specific to certain aspects (like container names or other labels) but those should be based on apps and not on teams. The lifecycle of an alert rule is bound to an application or component, but not to the lifecycle of our teams.

Slack: https://gigantic.slack.com/archives/C04TGHDEF/p1666272777347089

teemow commented 1 year ago

@TheoBrigitte I had this discussion before with Ross and we never managed to change the alert rules towards apps and not teams. I think the described way above should work and would save us a lot of work in the future. It isn't super important but a good thing to try out and if it works we should explain this to the other teams. Our rules would be much cleaner and easier to maintain.