Ensure alerting and recording rules are correctly evaluated by the mimir ruler

QuentinBisson commented 9 months ago

Towards https://github.com/giantswarm/roadmap/issues/3039 Let's validate our current recording and alerting rules on mimir.

Depends on https://github.com/giantswarm/roadmap/issues/3127

QuantumEnigmaa commented 8 months ago

Logs from mimir-ruler on golem :

ts=2024-02-27T14:41:47.305278335Z caller=ruler.go:564 level=info msg="syncing rules" reason=periodic

Seems to be ok according to the logs, but as mentioned here, queries in grafana using mimir recording rules (whether using the prometheus or the mimir datasource) are not working (i.e empty query result, "no data")

QuentinBisson commented 8 months ago

Reviewed alerts from vpa to operator-kit

Here is a few comments to not forget:

Atlas:

SlothDown: Why is it do different from CadvisorDown and PrometheusOperatorDown?
PrometheusCantCommunicateWithKubernetesAPI this alert can also trigger for promtail and grafana agent...

Needs fixing:

ServiceLevelBurnRateTooHigh, I suspect recording rules
Ensure PrometheusMissingGrafanaCloud, PrometheusFailsToCommunicateWithRemoteStorageAPI and PrometheusRuleFailures work with new setup
all aggreation metrics (needs at least cluster_id and installation)
absent metrics

QuentinBisson commented 8 months ago

Alerting rules:

[x] apiserver.management-cluster.rules.yml
[x] apiserver.workload-cluster.rules.yml
[x] app.rules.yml
[x] aws-load-balancer-controller.rules.yml
[x] aws.management-cluster.rules.yml
[x] aws.workload-cluster.rules.yml
[x] calico.rules.yml
[x] capa.management-cluster.rules.yml
[x] capi-cluster.rules.yml
[x] capi-kubeadmcontrolplane.rules.yml
[x] capi-machine.rules.yml
[x] capi-machinedeployment.rules.yml
[x] capi-machinepool.rules.yml
[x] capi-machineset.rules.yml
[x] capi.management-cluster.rules.yml
[x] cert-manager.rules.yml
[x] certificate.all.rules.yml
[x] certificate.management-cluster.rules.yml
[x] certificate.workload-cluster.rules.yml
[x] chart.rules.yml
[x] cilium.rules.yml
[x] cluster-autoscaler.rules.yml
[x] cluster-service.rules.yml
[x] configmap.management-cluster.rules.yml
[x] configmap.workload-cluster.rules.yml
[x] coredns.rules.yml
[x] credentiald.rules.yml
[x] crossplane.rules.yml
[x] crsync.rules.yml
[x] daemonset.management-cluster.rules.yml
[x] deployment.management-cluster.rules.yml
[x] deployment.workload-cluster.rules.yml
[x] dex.rules.yml
[x] disk.management-cluster.rules.yml
[x] disk.workload-cluster.rules.yml
[x] dns-operator-azure.rules.yml
[x] docker.rules.yml
[x] elasticsearch.rules.yml
[x] etcd.management-cluster.rules.yml
[x] etcd.workload-cluster.rules.yml
[x] etcdbackup.rules.yml
[x] external-dns.rules.yml
[x] external-secrets.rules.yml
[x] fairness.rules.yml
[x] falco.rules.yml
[x] fluentbit.rules.yml
[x] flux.rules.yml
[x] grafana.management-cluster.rules.yml
[x] helm.rules.yml
[x] ingress-controller.rules.yml
[x] inhibit.all.rules.yml
[x] inhibit.management-cluster.rules.yml
[x] inhibit.prometheus-agent.rules.yml
[x] job.rules.yml
[x] keda.rules.yml
[x] kiam.rules.yml
[x] kong.rules.yml
[x] kube-state-metrics.rules.yml
[x] kubelet.management-cluster.rules.yml
[x] kubelet.workload-cluster.rules.yml
[x] kyverno.all.rules.yml
[x] linkerd.deployment.rules.yml
[x] loki.all.rules.yml
[x] managed-logging.rules.yml
[x] management-cluster.rules.yml
[x] microendpoint.rules.yml
[x] mimir.rules.yml
[x] net-exporter.rules.yml
[x] network.all.rules.yml
[x] node-exporter.all.rules.yml
[x] node.management_cluster.rules.yml
[x] node.workload_cluster.rules.yml

Recording rules

[x] grafana-cloud.rules.yml
[x] gs-managed-app-deployment-status.rules.yml
[x] helm-operations.rules.yml
[x] kube-prometheus-mixins.rules.yml
[x] kubernetes-mixins.rules.yml
[x] loki-mixins.rules.yml
[x] mimir-mixins.rules.yml
[x] service-level.rules.yml
[x] tempo-mixins.rules.yml

Plus Sloth rules

QuentinBisson commented 7 months ago

@giantswarm/team-atlas I think I will close this issue as the main things have been fixed.

What is left is to fix on the remaining PRs are merged is:

all aggregation (sum, avg, and so on) and join metrics on. Those need at least the cluster_id label
alerts that use absent but I'm not sure how to fix that. @hervenicol maybe you have an idea?

My idea is to create a migration issue that would reference one issue per team (us included) to review their alerts, test the alert expressions work on golem with a MC and a WC deployed and also to make sure all their apps are using service monitors.

What do you think about this?

I will write a draft for the issue description on thursday and ask for your feedback

QuentinBisson commented 7 months ago

Extra fixes https://github.com/giantswarm/prometheus-rules/pull/1060

QuentinBisson commented 7 months ago

Migration tracking issue https://github.com/giantswarm/roadmap/issues/3312

QuentinBisson commented 7 months ago

The rule fixes are getting released https://github.com/giantswarm/prometheus-rules/pull/1063. I consider this isuse closed

giantswarm / roadmap

Ensure alerting and recording rules are correctly evaluated by the mimir ruler #3157

Alerting rules:

Recording rules