Open tobiscr opened 2 months ago
Define a concept which ensures that any state change in the FSM which causes an error will trigger an retry with exponential backoff
The FSM is just a convenient way how we organise the code. The kyma-infrastructure-manager
controller is using the Kubebuilder SDK and we handle retries:
Any non-recoverable situation (e.g. when max amount of retries is reached) has to trigger an alert which notifies the on-call and development team about a illegal/defective state in FSM
The Kubebuilder SDK exposes basic metrics (provided by controller-runtime
) that can be used to trigger the alerts. If we will find the basic metrics insufficient, we can extend them.
Depends on https://github.com/kyma-project/infrastructure-manager/issues/113 for establishing KPIs and alerting
Description
As result of the Threat Modelling workshop from Aug 2024, we identified that a reliably applied retry- and also alerting mechanism as crucial to prevent KIM from inconsistent states.
KIM's logic as to ensure:
AC:
Depends on https://github.com/kyma-project/infrastructure-manager/issues/113