Closed QuentinBisson closed 7 months ago
Logs from mimir-ruler
on golem
:
ts=2024-02-27T14:41:47.305278335Z caller=ruler.go:564 level=info msg="syncing rules" reason=periodic
Seems to be ok according to the logs, but as mentioned here, queries in grafana using mimir recording rules (whether using the prometheus or the mimir datasource) are not working (i.e empty query result, "no data")
Reviewed alerts from vpa to operator-kit
Here is a few comments to not forget:
Atlas:
SlothDown
: Why is it do different from CadvisorDown and PrometheusOperatorDown?PrometheusCantCommunicateWithKubernetesAPI
this alert can also trigger for promtail and grafana agent...Needs fixing:
PrometheusMissingGrafanaCloud
, PrometheusFailsToCommunicateWithRemoteStorageAPI
and PrometheusRuleFailures
work with new setupPlus Sloth rules
@giantswarm/team-atlas I think I will close this issue as the main things have been fixed.
What is left is to fix on the remaining PRs are merged is:
sum
, avg
, and so on) and join metrics on
. Those need at least the cluster_id labelabsent
but I'm not sure how to fix that. @hervenicol maybe you have an idea?My idea is to create a migration issue that would reference one issue per team (us included) to review their alerts, test the alert expressions work on golem with a MC and a WC deployed and also to make sure all their apps are using service monitors.
What do you think about this?
I will write a draft for the issue description on thursday and ask for your feedback
Migration tracking issue https://github.com/giantswarm/roadmap/issues/3312
The rule fixes are getting released https://github.com/giantswarm/prometheus-rules/pull/1063. I consider this isuse closed
Towards https://github.com/giantswarm/roadmap/issues/3039 Let's validate our current recording and alerting rules on mimir.
Depends on https://github.com/giantswarm/roadmap/issues/3127