[QG] Metrics: Identify and implement business critical metrics / KPIs, define an action plan and configure alerting rules

tobiscr commented 9 months ago

Description

With #11 we are able to make the Infrastructure Manager transparent and also simplify our operational life by establishing smart metrics and alerting rules.

Goals of this task is to identify which metrics / KPIs are business relevant and what the critical threshold for it are. We also have to define an action plan when such a threshold is reached which trigger a required action to bring our business back on track. Finally, alerting rules have to be configured which inform us as soon as one of the thresholds is reached.

AC:

[ ] Investigation: Verify how metrics are supported by Kubebuilder and how other teams are implementing them to reuse known pattern
[x] Think about technical and business critical metrics / KPIs which give a clear indication of the quality and health of the Infrastructure Manager (see comment below)
- [x] Define the reason why this metric is relevant and what it represents.
  - [x] Mandatory: metrics of REST client (especially egress traffic and their error rates etc.)
- [x] Define the threshold (min <> max etc.) which indicate an service degradation or health issue of the Infrastructure Manager. If a metric has no threshold, verify if it's for us still helpful to measure this value.
- [x] Specify the required action that has to be applied if a threshold is reached to recover the Infrastructure Manager into a productive and healthy state
- [x] Present the results in the team to collect the feedback of the colleagues.
[ ] Implement the identify business metrics in the Infrastructure Manager
- [ ] Requirement from SRE: expose metrics of REST client (e.g. egress-traffic to Gardener or K8s in-cluster API) to be able to detect server-side / client-side errors.
[ ] Configure alerting rules which inform the team as soon as one of the thresholds is reached

Reasons

Improve operational quality and simplify on-call shifts by establish proper metrics/KPI measuring and alerting.

Extends #11

Attachments

tobiscr commented 3 weeks ago

For the beginning we will measure only the amount of non-healthy Gardener clusters:

KPI	Description	Threshold which triggers an alert
Number of Gardener Clusters in non-healthy state	Counting all RuntimeCRs which are in state `failed`	>0

koala7659 commented 2 weeks ago

Mockup if dashboard idea:

koala7659 commented 2 weeks ago

Following metrics collected:

Runtime states as they are updated during sFnUpdateStatus() function
Unexpected stops of FSM when the machine stops before finishing processing with one of following functions :
- updateStatusAndStop()
- stop()
- updateStatusAndStopWithError()

Additionally after some discussions I will also include to the dashboard some metrics from kubebuilder that we can be use for our performance tests

kyma-project / infrastructure-manager

[QG] Metrics: Identify and implement business critical metrics / KPIs, define an action plan and configure alerting rules #113