Closed Tomasz-Smelcerz-SAP closed 2 days ago
Diagram: Istio-Gateway Secret Manager logic: https://www.mermaidchart.com/raw/675dbc64-4621-4d39-b660-add383087dc3?theme=light&version=v0.1&format=svg
Diagram: Kyma Reconciliation Logic after the changes proposed in this issue: https://www.mermaidchart.com/raw/f68fb76a-79db-4dac-a646-c99541407eed?theme=light&version=v0.1&format=svg
An API-less controller for one secret does not make sense and the Informer mechanism is also not suitable for our purposes since we can only react on events on the secret (so it will be too late already, the CA certificate would have been already rotated)
AFAIK, a go routine started in the main.go is the most sensible approach. It can either sleep a fixed amount of time and check if the CA ceritifcate is close to rotation or it can look at the renewal period in the Certifcate resource and calculate a timedelta and sleep for that period (this is more efficient but it should be coded in a way that if KLM pod dies, the timedelta is always calculated anew to avoid missing the CA rotation)
@LeelaChacha I don't think I got your point. We don't don't need to know exactly when the Cert is rotated. We just need to know that it has been recently rotated. This is because we don't need the secret's "previous" value - but a new one, after the rotation. Then we copy this value to our "IstioGatewaySecret". And if we, for any reason, need the "previous" value, then we have it - in the existing IstioGatewaySecret - it was copied there on previous rotation or on the system bootstrap. So a single goroutine with a watch on the secret should be enough.
Now the question remains: "how quickly should we make the copy of the secret?". The answer is: we can be quite "slow" or "lazy", if you prefer :) I'd say that probably everything works fine even if we detect the change after few hours from the actual rotation. Why? Because nothing would really "break" - the Istio Gateway has a valid certificate and the clients have valid certificate. Remember that rotation doesn't invalidate the "old" cert. It is still valid until it's expiration time. And this expiration time is the only "hard" time boundary for our solution. I assume rotation is configured to happen at least 24h before expiration so that gives us plenty of time. But to provide some reasonable value for the first iteration: We should detect the rotation (and make a copy) in less than 5 minutes. This triggers the client secret migration procedure, and they have to be migrated before the "old" cert expires, otherwise they'll lose connectivity.
The current code with a condition for secret removal: https://github.com/kyma-project/lifecycle-manager/blob/f0ac9bb8aa0c8770409b1123e2c4fbf4d75980f4/pkg/watcher/certificate_manager.go#L299
@LeelaChacha We just need to know that it has been recently rotated. This is because we don't need the secret's "previous" value - but a new one, after the rotation. Then we copy this value to our "IstioGatewaySecret".
I get your point. But an informer is still not a good way to watch for the secret since an informer can execute a function by attaching an event handler to events of a kind only. That means all events on all secrets would trigger the event handler and waste resources until we can actually check the namespace and name of the secret to check if it is even relevant.
Furthermore, using Informers creates duplicated caches as it does not share the source.Kind caches that the controllers use. So we will double the memory usuage.
@LeelaChacha I think there are ways to limit the number of objects you watch, in particular you can watch for events on just a single k8s resource. But I don't know how to do it using controller-runtime library ATM. Do you think we need some additional, short PoC for the "how to watch for the secret" problem?
Edit: I have found this - may be useful.
Description
In order to introduce the zero-downtime procedure, we need a safe migration path. By "safe migration path" I understand a setup, where a "revert" to the old behavior is as simple as possible - in case there is a bug in the new solution or it is not working as expected for whatever reason. In particular, the revert should be as simple as switching some Lifecycle-Manager flags (runtime arguments) - the less, the better.
An Idea for how this could work:
The current solution uses root certificate secret as the Istio-Gateway secret directly. Once the root cert is rotated, the LM code deletes the client secrets (that are based on the "previous" root), causing cert-manager to renew them. The new ("zero-downtime") solution uses a dedicated secret for the Istio-Gateway, managed entirely by Lifecycle-Manager. This new secret of course has a different name from the "root" secret, that is still managed and rotated by the cert-manager. The dedicated secret decouples Istio-Gateway from changes to the root certificate - it is Lifecycle-Manager that decides when to propagate the changes. How can we revert from the new solution to the old one? If the secret name used for Istio-Gateway is different in the old (current) and new (future) solution, then in order to switch back we would have to change the Helm Charts, or at least the entry in the
values.yaml
. In addition we would probably need to deploy a different LM version - the one with the "old" logic. But then we're reverting the Lifecycle-Manager version, along with all the other features, security fixes etc.To improve the situation, we should:
The first requirement is relatively easy - we just need to extract the relevant cert-management logic to a component and then provide two different implementations: the old one (current) and the new one. And we need a flag to decide which component should be actually used at runtime.
The second requirement is more tricky. In order to make it work, we should change the Lifecycle-Manager in the following way (it's just an idea, maybe it can be done in a simpler way):
By introducing this solution ("safe-migration"), we achieve the following:
--cert-mangament-legacy=true
, assuming no additional configuration flags are required.Implementation notes for the syncing "agent" - for lack of a better name, let's call it: "Istio Gateway Secret Manager"
See also: https://github.com/kyma-project/lifecycle-manager/issues/1506
Reasons
Safe migration path - ability to revert fast and with minimal risk of doing it wrong - in case of troubles.
Implementation notes:
Acceptance Criteria
Feature Testing
No response
Testing approach
No response
Attachments
No response
Related Issues:
https://github.com/kyma-project/lifecycle-manager/issues/1430