Mimir Migration Readiness for CAPA: Alerting and Dashboard Reviews

QuentinBisson commented 6 months ago

Atlas is planning to migrate our monitoring setup to mimir targetting CAPI only. This will result in all data being in a single database, instead of the current one-prometheus-per-cluster setup. Current alerts have to be updated as queries will see all data for all clusters, MC and WC alike, instead of data for one specific cluster at a time.

We already did a lot of work towards this on the current alerts (removed a lot of deprecated alerts and providers, fixed alerts that clearly were not working an so on).

By doing so, we discovered a few things about Mimir itself but also that a chunk of our alert currently do not work on CAPI (e.g. based on vintage only components, deprecated and missing metrics an so on).

To ensure proper monitoring in CAPI and with Mimir, Atlas needs your help!

We would kindly ask all teams to help us out for the following use-cases, ordered in terms of priorities if they can't be performed all at once.

0. Create kickoff meetings for each teams

[x] Phoenix kick off meeting done - To be scheduled
[x] Turtles kick off meeting done - To be scheduled
[ ] Shield kick off meeting done - To be scheduled
[ ] Honeybadger kick off meeting done - To be scheduled
[ ] BigMac kick off meeting done - To be scheduled
[ ] Cabbage kick off meeting done - To be scheduled

1. Test and fix your teams alerts and dashboards on CAPI clusters.

A lot of the alerts we have do not work on CAPI (e.g. cluster-autoscaler, ebs-csi and external-dns) simply because they are flagged behind the "aws" provider only, or because they rely on metrics of vintage components (cluster_created|upgraded inhibitions). The specific alerts issue that were identified will be added to the team issues.

[x] Phoenix finished reviewing their alerts in prometheus-rules and sloth-rules on CAPI
[x] Turtles finished reviewing their alerts in prometheus-rules and sloth-rules on CAPI
[x] Shield finished reviewing their alerts in prometheus-rules and sloth-rules on CAPI
[x] Atlas finished reviewing their alerts in prometheus-rules and sloth-rules and dashboards on CAPI
[x] Honeybadger finished reviewing their alerts in prometheus-rules and sloth-rules and dashboards on CAPI
[x] BigMac finished reviewing their alerts in prometheus-rules and sloth-rules and dashboards on CAPI
[x] Cabbage finished reviewing their alerts in prometheus-rules and sloth-rules and dashboards on CAPI

2. Test and fix your teams alerts and dashboards on Mimir.

We currently have Mimir deployed on Golem for testing of alerts accessible as a datasource in grafana.

Current known/unknown with Mimir are behing written here by @giantswarm/team-atlas but feel free to add what you found.

We request a second round of testing for Mimir because Mimir in inherently different from our vintage monitoring setup. First,all metrics will be stored in one central place (we are not enabling multi-tenancy yet). This means that:

No two alerts should have the same name
All aggregations (sum, count, ...) must at the very least have cluster_id in the by clause
All joins must at the very least have cluster_id in the on clause
the promql absent function should be use carefully because this function renders an empty vector so having it empty for all clusters in a MC seems relatively impossible. If you target 1 cluster in particular, this could work (cluster_type="management_cluster" for example but we think it's best to rely on other mechanisms)

Second, for grafana cloud, we rely a lot on external labels (labels added by prometheus when metrics leave the cluster like installation, provider and so on) but data sent from mimir to grafana cloud will not have those external labels anymore so recording rules aggregations and join must contain all eternal labels in the on and by clauses (that was mostly done by atlas but please review)

Third, we know that the alerting link (prometheus query) in opsgenie and slack will not work directly because Mimir does not have a UI per se (hint: it's grafana). The only way to have this source link back is to migrate to mimir's alertmanager but that's a whole over beast that we cannot tacke right now so we advise you, for each alert, to try to find a dashboard can be linked to the alert to help with oncall.

[x] Phoenix finished reviewing their alerts in prometheus-rules and sloth-rules and dashboards on Mimir
[x] Turtles finished reviewing their alerts in prometheus-rules and sloth-rules and dashboards on Mimir
[x] Shield finished reviewing their alerts in prometheus-rules and sloth-rules and dashboards on Mimir
[x] Atlas finished reviewing their alerts in prometheus-rules and sloth-rules and dashboardson Mimir
[x] Honeybadger finished reviewing their alerts in prometheus-rules and sloth-rules on Mimir
[x] BigMac finished reviewing their alerts in prometheus-rules and sloth-rules and dashboards on Mimir
[x] Cabbage finished reviewing their alerts in prometheus-rules and sloth-rules and dashboards on Mimir

3. Move away from the old slo framework towards sloth

We deprecated the old slo dashboard a while ago in favor of sloth but teams are not really using it. We would love if you could replace the old slo alerts with sloth-based ones.

[ ] Phoenix migrated away from the old slo framework using sloth (tested in both CAPI and Mimir)
[ ] Turtles migrated away from the old slo framework (tested in both CAPI and Mimir)
[x] Shield migrated away from the old slo framework (tested in both CAPI and Mimir)
[x] Atlas migrated away from the old slo framework (tested in both CAPI and Mimir)
[ ] Honeybadger migrated away from the old slo framework (tested in both CAPI and Mimir)
[x] BigMac migrated away from the old slo framework (tested in both CAPI and Mimir)
[x] Cabbage migrated away from the old slo framework (tested in both CAPI and Mimir)

4. Test Grafana Cloud dashboards with golem data

As mimir data will be sent to grafana cloud by a single prometheus with no external labels, we would like you to ensure the grafana cloud dashboard that your team owns work on golem.

This is currently blocked by https://github.com/giantswarm/roadmap/issues/3159

[ ] Phoenix fixed their grafana cloud dashboards
[ ] Turtles fixed their grafana cloud dashboards
[ ] Shield fixed their grafana cloud dashboards
[x] Atlas fixed their grafana cloud dashboards
[ ] Honeybadger fixed their grafana cloud dashboards
[ ] BigMac fixed their grafana cloud dashboards
[ ] Cabbage fixed their grafana cloud dashboards

5. Move all apps (latest versions) to service monitors

Towards closing this https://github.com/giantswarm/giantswarm/issues/27145

There are still some leftovers (although not a lot) that still need to use a service monitor. Without this, we will not be able to tear down our Prometheus stack.

This is not that much of a priority but the effort should be rather small and easy to finish so feel free to pick this up

To easily find out what is not monitored via service monitors, you can connect to a MC and WC prometheus using opsctl open -i -a prometheus --workload-cluster= and check out the targets page. If they are there (be careful to also check out the workload section), they need a servicemonitor :)

[x] Phoenix added their missing service monitors
[x] Turtles added their missing service monitors
[ ] Shield added their missing service monitors
[x] Atlas fixed their missing service monitors
[x] Honeybadger added their missing service monitors
[x] BigMac added their missing service monitors
[x] Cabbage added their missing service monitors

We will of course be here to help you for the migration :)

Further info:

To help you, you can always add alert tests in prometheus-rules, those are great :)

### Tasks
- [ ] https://github.com/giantswarm/roadmap/issues/3314
- [ ] https://github.com/giantswarm/roadmap/issues/3315
- [ ] https://github.com/giantswarm/roadmap/issues/3316
- [ ] https://github.com/giantswarm/roadmap/issues/3317
- [ ] https://github.com/giantswarm/roadmap/issues/3318
- [ ] https://github.com/giantswarm/roadmap/issues/3319
- [ ] https://github.com/giantswarm/roadmap/issues/3320

QuentinBisson commented 3 months ago

Atlas is done @Rotfuks

QuentinBisson commented 3 months ago

Step 1 and 2 were done for all teams. Do we want to create séparate issues about the other topics and close hère?

Rotfuks commented 3 months ago

Yeah, makes sense. I'll create some follow ups and then clean these up.

Rotfuks commented 3 months ago

This is mostly being done, every open point is being moved to individual topics like

the slo framework: giantswarm/giantswarm#31095
service monitors: giantswarm/giantswarm#31094

Rotfuks commented 3 months ago

So this can be closed now as being done with some additional scope being cut out.

giantswarm / roadmap