Closed thunko closed 3 weeks ago
+1 best approach I think is compare with history. if apiserver disappear then raise alert.
Another technic is . move out only "up" rule to separtated group and deploy it per cluster. this way we have common rule and each-cluster rule.
This is tough when considering auto-scaling node groups. For example, if a node is scaled down and removed intentionally, that shouldn't trigger an alert. So taking every single instance into account seems difficult.
However, you could try and assert that at least one instance of the API server job is present in each cluster with a query like:
# This query lists all clusters found by kube_node_info, and marks them as either
# 1 or 0 depending on if they have up{job="kube-apiserver"}, or not (respectively).
#
# List all clusters and mark them value: 0
# {cluster="my-cluster-without-apiserver-job"} 0
1 - group by (cluster) (max by (cluster, node) (kube_node_info{cluster!=""}))
unless on (cluster) (
# except those clusters with kube-apiserver
group by (cluster) (up{job="kube-apiserver", cluster!=""})
)
# List all clusters with kube-apiserver and mark them with value: 1
or on (cluster) (
# {cluster="my-cluster-without-apiserver-job"} 0
group by (cluster) (max by (cluster, node) (kube_node_info{cluster!=""}))
)
But this is use-case dependent.
Some users would want ALL clusters to have the apiserver job, which is fairly easy to alert on (look for anything with a value of zero).
However, some users would want apiserver on only certain clusters, which likely needs the query to be modified to match only the subset of clusters which are intended to have apiserver job.
This issue has not had any activity in the past 30 days, so the
stale
label has been added to it.
stale
label will be removed if there is new activitykeepalive
label to exempt this issue from the stale check actionThank you for your contributions!
hi,
I get the following rule when generating prometheus alerts for kubeapi:
The issue that I'm running into is, that my prometheus instance reads data for several clusters, meaning if I add this rule, it doesn't work as intended because the alert will not trigger as long as there is any KubeAPI that is up.
I could create a rule for each cluster, but I'd like to avoid hard-coding.
Have you run into a similar situation and what would you suggest for such use case ? Thank you,