Turtles general alerting review and migration to mimir

QuentinBisson commented 4 months ago

Towards https://github.com/giantswarm/roadmap/issues/3312

Atlas is planning to migrate our monitoring setup to mimir targetting CAPI only. This will result in all data being in a single database, instead of the current one-prometheus-per-cluster setup. Current alerts have to be updated as queries will see all data for all clusters, MC and WC alike, instead of data for one specific cluster at a time.

We already did a lot of work towards this on the current alerts (removed a lot of deprecated alerts and providers, fixed alerts that clearly were not working an so on).

By doing so, we discovered a few things about Mimir itself but also that a chunk of our alert currently do not work on CAPI (e.g. based on vintage only components, deprecated and missing metrics an so on).

To ensure proper monitoring in CAPI and with Mimir, Atlas needs your help!

We would kindly ask all teams to help us out for the following use-cases, ordered in terms of priorities if they can't be performed all at once.

0. Create kickoff meetings for each teams

[x] Turtles kick off meeting done - To be scheduled by atlas

1. Test and fix your teams alerts and dashboards on CAPI clusters.

A lot of the alerts we have do not work on CAPI (e.g. cluster-autoscaler, ebs-csi and external-dns) simply because they are flagged behind the "aws" provider only, or because they rely on metrics of vintage components (cluster_created|upgraded inhibitions). The specific alerts issue that were identified will be added to the team issues.

[x] Turtles finished reviewing their alerts in prometheus-rules and sloth-rules on CAPI - please report it in the umbrella issue

2. Test and fix your teams alerts and dashboards on Mimir.

We currently have Mimir deployed on Golem for testing of alerts accessible as a datasource in grafana.

Current known/unknown with Mimir are behing written here by @giantswarm/team-atlas but feel free to add what you found.

We request a second round of testing for Mimir because Mimir in inherently different from our vintage monitoring setup. First,all metrics will be stored in one central place (we are not enabling multi-tenancy yet). This means that:

No two alerts should have the same name
All aggregations (sum, count, ...) must at the very least have cluster_id, provider and pipeline in the by clause
All joins must at the very least have cluster_id, provider and pipeline in the on clause
the promql absent function should be use carefully because this function renders an empty vector so having it empty for all clusters in a MC seems relatively impossible. If you target 1 cluster in particular, this could work (cluster_type="management_cluster" for example but we think it's best to rely on other mechanisms)

Second, for grafana cloud, we rely a lot on external labels (labels added by prometheus when metrics leave the cluster like installation, provider and so on) but data sent from mimir to grafana cloud will not have those external labels anymore so recording rules aggregations and join must contain all eternal labels in the on and by clauses (that was mostly done by atlas but please review)

Third, we know that the alerting link (prometheus query) in opsgenie and slack will not work directly because Mimir does not have a UI per se (hint: it's grafana). The only way to have this source link back is to migrate to mimir's alertmanager but that's a whole over beast that we cannot tacke right now so we advise you, for each alert, to try to find a dashboard can be linked to the alert to help with oncall.

[x] Turtles finished reviewing their alerts in prometheus-rules and sloth-rules and dashboards on Mimir - please report it in the umbrella issue

4. Test Grafana Cloud dashboards with golem data

As mimir data will be sent to grafana cloud by a single prometheus with no external labels, we would like you to ensure the grafana cloud dashboard that your team owns work on golem.

This is currently blocked by https://github.com/giantswarm/roadmap/issues/3159

[ ] Turtles fixed their grafana cloud dashboards - please report it in the umbrella issue

Further info:

To help you, you can always add alert tests in prometheus-rules, those are great :)

QuentinBisson commented 4 months ago

@yulianedyalkova and @Rotfuks coming from https://github.com/giantswarm/giantswarm/issues/29551 here are a few things I found that you might want to address in subtasks:

Is it expected that our inhibitions (cluster_upgrading, cluster_creating) only work in vintage as they use metrics coming from vintage components? I'm fine if that is the case but other teams need to be aware
Do we want and need replacement for WorkloadClusterControlPlaneNodeMissingAWS and WorkloadClusterHAControlPlaneDownForTooLong ? Maybe we have for CAPI but I'm not sure I really found equivalents (they could be in Sloth as well, did not check)
NodeStateFlappingUnderLoad is not wokring due to some missing labels (ip at least)

weseven commented 4 months ago

@yulianedyalkova and @Rotfuks coming from giantswarm/giantswarm#29551 here are a few things I found that you might want to address in subtasks:
* Is it expected that our inhibitions (cluster_upgrading, cluster_creating) only work in vintage as they use metrics coming from vintage components? I'm fine if that is the case but other teams need to be aware

We'll definitely need to find an alternative for CAPI for those inhibitions, otherwise there will be a lot of avoidable alerts. We'll need to put some thought on it. Thanks for pointing it out!

* Do we want and need replacement for WorkloadClusterControlPlaneNodeMissingAWS and WorkloadClusterHAControlPlaneDownForTooLong ? Maybe we have for CAPI but I'm not sure I really found equivalents (they could be in Sloth as well, did not check)

We might want to re-evaluate these two. One paged last week but there was no issue in sight: https://gigantic.slack.com/archives/C02HLSDH3DZ/p1710412366462349 I will bring them up in the next kaas alert session.

* NodeStateFlappingUnderLoad is not wokring due to some missing labels (ip at least)

Thanks, we will need to fix this :)

weseven commented 3 months ago

We might want to re-evaluate these two. One paged last week but there was no issue in sight: https://gigantic.slack.com/archives/C02HLSDH3DZ/p1710412366462349 I will bring them up in the next kaas alert session.

Brought this up in today's kaas alert call, we won't update these two alerts for CAPA, and the alert seems a bit unreliable on vintage (last occurrences were false positives, e.g. #1, #2). cc @yulianedyalkova

Rotfuks commented 1 month ago

Slightly adapted the scope of this ticket to be more focused on the alert and dashboard review which is more important for the migration to mimir. So only the review of the grafana cloud dashboards is relevant here now.

giantswarm / roadmap