feat: [Zero-Downtime] - safe migration

Tomasz-Smelcerz-SAP commented 2 months ago

Description

In order to introduce the zero-downtime procedure, we need a safe migration path. By "safe migration path" I understand a setup, where a "revert" to the old behavior is as simple as possible - in case there is a bug in the new solution or it is not working as expected for whatever reason. In particular, the revert should be as simple as switching some Lifecycle-Manager flags (runtime arguments) - the less, the better.

An Idea for how this could work:

The current solution uses root certificate secret as the Istio-Gateway secret directly. Once the root cert is rotated, the LM code deletes the client secrets (that are based on the "previous" root), causing cert-manager to renew them. The new ("zero-downtime") solution uses a dedicated secret for the Istio-Gateway, managed entirely by Lifecycle-Manager. This new secret of course has a different name from the "root" secret, that is still managed and rotated by the cert-manager. The dedicated secret decouples Istio-Gateway from changes to the root certificate - it is Lifecycle-Manager that decides when to propagate the changes. How can we revert from the new solution to the old one? If the secret name used for Istio-Gateway is different in the old (current) and new (future) solution, then in order to switch back we would have to change the Helm Charts, or at least the entry in the values.yaml. In addition we would probably need to deploy a different LM version - the one with the "old" logic. But then we're reverting the Lifecycle-Manager version, along with all the other features, security fixes etc.

To improve the situation, we should:

have a LM version that is capable of running either the old or the new code - depending on some feature flag.
the Istio Gateway secret name should be the same in both scenarios, to avoid manual tweaks in the Helm charts.

The first requirement is relatively easy - we just need to extract the relevant cert-management logic to a component and then provide two different implementations: the old one (current) and the new one. And we need a flag to decide which component should be actually used at runtime.

The second requirement is more tricky. In order to make it work, we should change the Lifecycle-Manager in the following way (it's just an idea, maybe it can be done in a simpler way):

The root certificate is no longer directly used to configure Istio Gateway
We introduce our own, managed secret that is a copy of the Istio-Gatewy secret.
There is a new "agent" that actively syncs the root secret to the copy.
Minimal changes in the code are required - The reconciliation logic no longer inspects the timestamps on the root secret, it inspects the created copy instead
Kustomize configuration of the Istio Gateway (and Helm charts) are changed so that the new secret is used instead of the "root" one.

By introducing this solution ("safe-migration"), we achieve the following:

Both the "safe-migration" and "final" solutions use the same secret name for the Istio-Gateway. Hence, when reverting, there's no need to change anything in the Helm charts.
The revert can be accomplished by switching a single boolean flag on the Lifecycle-Manager, like: --cert-mangament-legacy=true, assuming no additional configuration flags are required.
Don't worry about the "copy" or the new "agent" - in the final solution these also exists. And this issue introduces just a temporary solution. Once the final solution works, we'll remove the support for the "old" logic entirely.

Implementation notes for the syncing "agent" - for lack of a better name, let's call it: "Istio Gateway Secret Manager"

What we really need is something that watches the root secret and acts upon changes on this object
So it can be a single goroutine, a mini-controller, etc. Maybe even something else - a little research is required.
See that although the secret name is the same, the data in the secret differs. The "safe-migration" solution makes IstioGatewaySecret just a copy of the root secret, while the final solution uses a modified (newRoot+oldRoot) values for the caBundle in the IstioGatewaySecret.

Reasons

Safe migration path - ability to revert fast and with minimal risk of doing it wrong - in case of troubles.

Implementation notes:

introduce a new mechanism for copying the secret. Can be a single goroutine with a watcher/informer. It should have a low-frequency polling in case k8s event is lost (AFAIK k8s events are not 100% reliable)
modify the condition upon which the skr watcher client secrets are removed: https://github.com/kyma-project/lifecycle-manager/blob/f0ac9bb8aa0c8770409b1123e2c4fbf4d75980f4/pkg/watcher/certificate_manager.go#L299
if necessary, consider adding some annotation on the Istio Gateway Secret (IGS) object to make "rotation detection" in the code easy.

Acceptance Criteria

[x] Double-check the proposed solution if it's not missing anything important
[ ] Verify with security team if the LM-managed secret is OK (we need this anyway for the "new solution")
[x] introduce a secret manager component that does a copy
[x] change the condition for watcher client secret removal so that it involves updates to the Istio Gataway Secret (which should be an up-to-date copy of the Root Secret)

Feature Testing

No response