After recent changes in infrastructure configuration, SKR secrets in KCP are rotated every hour or so.
When the secrets are rotated, the Lifecycle-Manager experiences a burst of Unauthorized errors.
Two reconciliation loops are mostly affected: Kyma reconciliation loop and the Manifest reconciliation loop.
The LM produces an event on every error, and eventually there's an error for every Kyma / Manifest object.
This means LM produces a burst of Unauthorized events with a rate reaching 90 events / minute.
We reduced the event size in https://github.com/kyma-project/lifecycle-manager/pull/1222, and ensured the clients are refreshed as soon as possible in https://github.com/kyma-project/lifecycle-manager/pull/1234, to reduce the time Unauthorized errors occur.
This helped already, as we don't observe LM restarts due to long ETCD time-outs (over 2 minutes).
Unfortunately we still observe ETCD outages occasionally.
The next step towards a "well behaving controller" is to reduce the number of events.
Note: It may be harder than it sounds. Normally we emit an event for some Kyma object. If we "compact" events for many Kymas into one , which Kyma object would be the target for the event? Maybe we should just suppress such errors? How about occasional Unauthorized errors, not related to secret rotation then? How to distinguish these two conditions?
Reasons
There's no point in creating hundreds of otherwise identical events.
The event bursts have negative impact on ETCD performance.
Acceptance Criteria
reduce the number of events created during secrets rotation.
Supress the event creation for "Unauthorized" errors: no event emitted at all
Introduce a new metrics that counts the number of "Unauthorized" errors for certain period of time (like 15 minutes)
Introduce an alert rule that is triggered when the number of "Unauthorized" error exceeds some threshold
Consider two different metrics, because we have two independent loops, both affected by Secret rotation: Kyma reconciliation and Manifest reconciliation
Description
After recent changes in infrastructure configuration, SKR secrets in KCP are rotated every hour or so. When the secrets are rotated, the Lifecycle-Manager experiences a burst of
Unauthorized
errors. Two reconciliation loops are mostly affected: Kyma reconciliation loop and the Manifest reconciliation loop. The LM produces an event on every error, and eventually there's an error for every Kyma / Manifest object. This means LM produces a burst ofUnauthorized
events with a rate reaching 90 events / minute. We reduced the event size in https://github.com/kyma-project/lifecycle-manager/pull/1222, and ensured the clients are refreshed as soon as possible in https://github.com/kyma-project/lifecycle-manager/pull/1234, to reduce the timeUnauthorized
errors occur. This helped already, as we don't observe LM restarts due to long ETCD time-outs (over 2 minutes). Unfortunately we still observe ETCD outages occasionally. The next step towards a "well behaving controller" is to reduce the number of events.Note: It may be harder than it sounds. Normally we emit an event for some Kyma object. If we "compact" events for many Kymas into one , which Kyma object would be the target for the event? Maybe we should just suppress such errors? How about occasional
Unauthorized
errors, not related to secret rotation then? How to distinguish these two conditions?Reasons
Acceptance Criteria
Feature Testing
No response
Testing approach
No response
Attachments
No response