kyma-project / infrastructure-manager

Apache License 2.0
0 stars 10 forks source link

[Threat Modelling] Ensure retry-logic applied for exceptional situations #349

Open tobiscr opened 2 months ago

tobiscr commented 2 months ago

Description

As result of the Threat Modelling workshop from Aug 2024, we identified that a reliably applied retry- and also alerting mechanism as crucial to prevent KIM from inconsistent states.

KIM's logic as to ensure:

AC:

Depends on https://github.com/kyma-project/infrastructure-manager/issues/113

m00g3n commented 2 months ago

Define a concept which ensures that any state change in the FSM which causes an error will trigger an retry with exponential backoff

The FSM is just a convenient way how we organise the code. The kyma-infrastructure-manager controller is using the Kubebuilder SDK and we handle retries:

Any non-recoverable situation (e.g. when max amount of retries is reached) has to trigger an alert which notifies the on-call and development team about a illegal/defective state in FSM

The Kubebuilder SDK exposes basic metrics (provided by controller-runtime) that can be used to trigger the alerts. If we will find the basic metrics insufficient, we can extend them.

tobiscr commented 1 month ago

Depends on https://github.com/kyma-project/infrastructure-manager/issues/113 for establishing KPIs and alerting