Open tobiscr opened 9 months ago
For the beginning we will measure only the amount of non-healthy Gardener clusters:
KPI | Description | Threshold which triggers an alert |
---|---|---|
Number of Gardener Clusters in non-healthy state | Counting all RuntimeCRs which are in state failed |
>0 |
Mockup if dashboard idea:
Following metrics collected:
sFnUpdateStatus()
functionupdateStatusAndStop()
stop()
updateStatusAndStopWithError()
Additionally after some discussions I will also include to the dashboard some metrics from kubebuilder that we can be use for our performance tests
Description
With #11 we are able to make the Infrastructure Manager transparent and also simplify our operational life by establishing smart metrics and alerting rules.
Goals of this task is to identify which metrics / KPIs are business relevant and what the critical threshold for it are. We also have to define an action plan when such a threshold is reached which trigger a required action to bring our business back on track. Finally, alerting rules have to be configured which inform us as soon as one of the thresholds is reached.
AC:
Reasons
Improve operational quality and simplify on-call shifts by establish proper metrics/KPI measuring and alerting.
Extends #11
Attachments