Supress event creation during SKR secret rotation

Tomasz-Smelcerz-SAP commented 9 months ago

Description

After recent changes in infrastructure configuration, SKR secrets in KCP are rotated every hour or so. When the secrets are rotated, the Lifecycle-Manager experiences a burst of Unauthorized errors. Two reconciliation loops are mostly affected: Kyma reconciliation loop and the Manifest reconciliation loop. The LM produces an event on every error, and eventually there's an error for every Kyma / Manifest object. This means LM produces a burst of Unauthorized events with a rate reaching 90 events / minute. We reduced the event size in https://github.com/kyma-project/lifecycle-manager/pull/1222, and ensured the clients are refreshed as soon as possible in https://github.com/kyma-project/lifecycle-manager/pull/1234, to reduce the time Unauthorized errors occur. This helped already, as we don't observe LM restarts due to long ETCD time-outs (over 2 minutes). Unfortunately we still observe ETCD outages occasionally. The next step towards a "well behaving controller" is to reduce the number of events.

Note: It may be harder than it sounds. Normally we emit an event for some Kyma object. If we "compact" events for many Kymas into one , which Kyma object would be the target for the event? Maybe we should just suppress such errors? How about occasional Unauthorized errors, not related to secret rotation then? How to distinguish these two conditions?

Reasons

There's no point in creating hundreds of otherwise identical events.
The event bursts have negative impact on ETCD performance.

Acceptance Criteria

reduce the number of events created during secrets rotation.

Feature Testing

No response

Testing approach

No response

Attachments

No response

Tomasz-Smelcerz-SAP commented 9 months ago

We decided to proceed as follows:

Supress the event creation for "Unauthorized" errors: no event emitted at all
Introduce a new metrics that counts the number of "Unauthorized" errors for certain period of time (like 15 minutes)
Introduce an alert rule that is triggered when the number of "Unauthorized" error exceeds some threshold
Consider two different metrics, because we have two independent loops, both affected by Secret rotation: Kyma reconciliation and Manifest reconciliation

c-pius commented 8 months ago

Re-opening to address the proposal from Xin: https://github.com/kyma-project/lifecycle-manager/pull/1275#discussion_r1481305436

kyma-project / lifecycle-manager