kyma-project / infrastructure-manager

Apache License 2.0
0 stars 10 forks source link

[QG] Metrics: Identify and implement business critical metrics / KPIs, define an action plan and configure alerting rules #113

Open tobiscr opened 9 months ago

tobiscr commented 9 months ago

Description

With #11 we are able to make the Infrastructure Manager transparent and also simplify our operational life by establishing smart metrics and alerting rules.

Goals of this task is to identify which metrics / KPIs are business relevant and what the critical threshold for it are. We also have to define an action plan when such a threshold is reached which trigger a required action to bring our business back on track. Finally, alerting rules have to be configured which inform us as soon as one of the thresholds is reached.

AC:

Reasons

Improve operational quality and simplify on-call shifts by establish proper metrics/KPI measuring and alerting.

Extends #11

Attachments

tobiscr commented 3 weeks ago

For the beginning we will measure only the amount of non-healthy Gardener clusters:

KPI Description Threshold which triggers an alert
Number of Gardener Clusters in non-healthy state Counting all RuntimeCRs which are in state failed >0
koala7659 commented 2 weeks ago

Mockup if dashboard idea:

Image

koala7659 commented 2 weeks ago

Following metrics collected:

Additionally after some discussions I will also include to the dashboard some metrics from kubebuilder that we can be use for our performance tests