kubernetes-monitoring / kubernetes-mixin

A set of Grafana dashboards and Prometheus alerts for Kubernetes.
Apache License 2.0
2.11k stars 597 forks source link

KubeAPIDown not working as intended if targets a set of clusters #825

Closed thunko closed 3 weeks ago

thunko commented 1 year ago

hi,

I get the following rule when generating prometheus alerts for kubeapi:

- "alert": "KubeAPIDown"
    "annotations":
      "description": "KubeAPI has disappeared from Prometheus target discovery."
      "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown"
      "summary": "Target disappeared from Prometheus target discovery."
    "expr": |
      absent(up{job="kube-apiserver"} == 1)
    "for": "15m"
    "labels":
      "severity": "critical"

The issue that I'm running into is, that my prometheus instance reads data for several clusters, meaning if I add this rule, it doesn't work as intended because the alert will not trigger as long as there is any KubeAPI that is up.
I could create a rule for each cluster, but I'd like to avoid hard-coding.

Have you run into a similar situation and what would you suggest for such use case ? Thank you,

zoftdev commented 2 months ago

+1 best approach I think is compare with history. if apiserver disappear then raise alert.

Another technic is . move out only "up" rule to separtated group and deploy it per cluster. this way we have common rule and each-cluster rule.

skl commented 2 months ago

This is tough when considering auto-scaling node groups. For example, if a node is scaled down and removed intentionally, that shouldn't trigger an alert. So taking every single instance into account seems difficult.

However, you could try and assert that at least one instance of the API server job is present in each cluster with a query like:

# This query lists all clusters found by kube_node_info, and marks them as either
# 1 or 0 depending on if they have up{job="kube-apiserver"}, or not (respectively).
#
# List all clusters and mark them value: 0
# {cluster="my-cluster-without-apiserver-job"} 0
1 - group by (cluster) (max by (cluster, node) (kube_node_info{cluster!=""}))
unless on (cluster) (
  # except those clusters with kube-apiserver
  group by (cluster) (up{job="kube-apiserver", cluster!=""})
)
# List all clusters with kube-apiserver and mark them with value: 1
or on (cluster) (
  # {cluster="my-cluster-without-apiserver-job"} 0
  group by (cluster) (max by (cluster, node) (kube_node_info{cluster!=""}))
)

But this is use-case dependent.

Some users would want ALL clusters to have the apiserver job, which is fairly easy to alert on (look for anything with a value of zero).

However, some users would want apiserver on only certain clusters, which likely needs the query to be modified to match only the subset of clusters which are intended to have apiserver job.

github-actions[bot] commented 1 month ago

This issue has not had any activity in the past 30 days, so the stale label has been added to it.

Thank you for your contributions!