redhat-cop / vault-config-operator

An operator to support Haschicorp Vault configuration workflows from within Kubernetes
Apache License 2.0
110 stars 50 forks source link

KVv2 secrets are always entirely removed from Vault when RandomSecret is deleted from Kubernetes since v0.8.17 #196

Open davoustp opened 1 year ago

davoustp commented 1 year ago

Release v0.8.17 changes the logic to delete the KVv2 secret entirely from Vault whenever the RandomSecret Kubernetes resource is deleted, instead of removing only the last version of the Vault secret (releases v0.8.16 and below).

As mentioned in https://github.com/redhat-cop/vault-config-operator/issues/134#issuecomment-1587520678 , there are situations where deleting only the last version of a KVv2 secret is still desirable.

A very common example is to use a RandomSecret to generate a root password or database admin password.

In such a case, you would like to protect this KVv2 secret at any cost - loosing it for any reason (think about an erroneous gitops operation removing the RandomSecret resource, or even a straight kubectl deletion) would be catastrophic as the secret can only be recovered from a Vault backup.

So I'd suggest to make this KVv2 deletion logic configurable.

The deletion logic (either delete-last-version or delete-all-versions) could be set using:

  1. an environment variable
  2. a new RandomSecret spec field (requires changing the CRD here)

Consider .2 first, then if not set, then use 1. , and if not set either, use the current behavior.

This looks to maximize the use case coverage and flexibility.

@trevorbox @raffaelespazzoli does this make sense? I can provide a PR along this lines.

smnbbrv commented 1 year ago

I find this very disturbing as well. As for me, I'd completely block the secret removal instead of even the version removal. If the secret has only one version and then this version gets removed, the entire secret is removed as well, if I get it right?

davoustp commented 1 year ago

I find this very disturbing as well. As for me, I'd completely block the secret removal instead of even the version removal. If the secret has only one version and then this version gets removed, the entire secret is removed as well, if I get it right?

With v0.8.17+, the entire secret is removed from Vault, irrespective of the number of versions. Previous behavior (for KVv2 secrets) was to mark the last version as deleted, so when attempting to access this secret, you would get an error stating that this secret does not exist. However, you could go to Vault and undelete this latest version, making it available again (no data loss here).

In some cases, the entire removal could be useful as well, such as a testing system with a lot of automation around it: not removing these secrets from Vault would leave a lot of useless mess behind, even potentially creating unexpected behaviour if reusing the same secrets across testing sessions.

So what I suggest is that the deletion logic should be made configurable to suit the use case at hand. Some users may decide that no secret should be deleted for production systems, because too sensitive if secret is lost. Some others could choose to delete the last version only, so that there is an error but without data loss (requiring manual undelete if this was a mistake). Some could decide that the full removal better suits their need because their deployment process is stable and robust enough to avoid the situation entirely...

My point is that there is no one-suits-all deletion logic and should be made configurable...

trevorbox commented 1 year ago

I find this very disturbing as well. As for me, I'd completely block the secret removal instead of even the version removal. If the secret has only one version and then this version gets removed, the entire secret is removed as well, if I get it right?

With v0.8.17+, the entire secret is removed from Vault, irrespective of the number of versions. Previous behavior (for KVv2 secrets) was to mark the last version as deleted, so when attempting to access this secret, you would get an error stating that this secret does not exist. However, you could go to Vault and undelete this latest version, making it available again (no data loss here).

In some cases, the entire removal could be useful as well, such as a testing system with a lot of automation around it: not removing these secrets from Vault would leave a lot of useless mess behind, even potentially creating unexpected behaviour if reusing the same secrets across testing sessions.

So what I suggest is that the deletion logic should be made configurable to suit the use case at hand. Some users may decide that no secret should be deleted for production systems, because too sensitive if secret is lost. Some others could choose to delete the last version only, so that there is an error but without data loss (requiring manual undelete if this was a mistake). Some could decide that the full removal better suits their need because their deployment process is stable and robust enough to avoid the situation entirely...

My point is that there is no one-suits-all deletion logic and should be made configurable...

I agree it would be safer to keep it configurable. The conditional should probably apply to both KVv1 and KVv2 RandomSecret types - completely remove the secret in vault or not upon RandomSecret deletion.

I don't see the point in worrying about deleting specific versions in a KVv2 secret, are there use cases anyone cares to share? I think it would be good to keep the deletion logic consistent/simple... completely delete or do nothing to the secret in vault.

davoustp commented 1 year ago

@trevorbox thanks for the feedback.

I agree it would be safer to keep it configurable. The conditional should probably apply to both KVv1 and KVv2 RandomSecret types - completely remove the secret in vault or not upon RandomSecret deletion.

Ok, that makes sense. Before I get into the coding phase, let's agree on details. The global environment variable to drive this behavior could be named KEEP_VAULT_KV_ON_REMOVAL (when true, then KV secrets are NOT removed from Vault when the resource is deleted from k8s). Thoughts, any other proposal?

Second, is there a need to override this settings at RandomSecret resource level? If yes, what could be the (optional) field's name?

I don't see the point in worrying about deleting specific versions in a KVv2 secret, are these use cases anyone cares to share? I think it would be good to keep the deletion logic consistent/simple... completely delete or do nothing to the secret in vault.

I can imagine some as written above (an additional one was to maintain previous versions behavior), but I personally have no such need.

smnbbrv commented 1 year ago

Probably it makes sense to reuse the k8s naming convention that they use for PersistentVolume that covers same issues.

https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming

davoustp commented 1 year ago

@smnbbrv you suggest to use the word "retain" instead of "keep" so that the environment variable is named something like RETAIN_VAULT_KV_ON_REMOVAL? I like that idea. 👍

trevorbox commented 1 year ago

@davoustp might be better to add another option to the the RandomSecretSpec for more granular control of deletion behavior? The retainVaulltKVSecretOnRemoval option should default to false to match existing behavior. Perhaps a RETAIN_VAULT_KV_SECRET_ON_REMOVAL_OVERRIDE env variable could override any individual option to make things simpler.

davoustp commented 1 year ago

@trevorbox agreed, more granular control is better. Last one: what's the best precedence? Use the envvar settings (defaulting to current behaviour) that can be overriden at resource level? Or the other way around?

smnbbrv commented 1 year ago

@smnbbrv you suggest to use the word "retain" instead of "keep" so that the environment variable is named something like RETAIN_VAULT_KV_ON_REMOVAL?

Not really, I would just call it policy. KV_SECRET_RETAIN_POLICY or similar, that can further accept values like retain, delete or even deleteLastVersion in case it is implemented

trevorbox commented 1 year ago

@trevorbox agreed, more granular control is better. Last one: what's the best precedence? Use the envvar settings (defaulting to current behaviour) that can be overriden at resource level? Or the other way around?

We need to decide if it would be useful to override any and all settings, if not then I don't see the point in an env var. I could see a use case to be able to override for testing purposes. Thoughts @davoustp @smnbbrv ?

@smnbbrv you suggest to use the word "retain" instead of "keep" so that the environment variable is named something like RETAIN_VAULT_KV_ON_REMOVAL?

Not really, I would just call it policy. KV_SECRET_RETAIN_POLICY or similar, that can further accept values like retain, delete or even deleteLastVersion in case it is implemented

I think that's a good idea, the field in the spec could be kvSecretRetainPolicy to default to delete with support for retain for now.

davoustp commented 1 year ago

We need to decide if it would be useful to override any and all settings, if not then I don't see the point in an env var. I could see a use case to be able to override for testing purposes. Thoughts @davoustp @smnbbrv ?

@trevorbox when running in production, I would shoot for the safest policy - that is, retaining the Vault KV secret on deletion. If there is no way to set this globally as the default value, then it would mean that we need to explicitly set the flag on each and every RandomSecret resource, which is a pain. Hence the proposal to use of an env var that would allow setting this globally (but any other global mechanism to set this flag would fit as well).

I think that's a good idea, the field in the spec could be kvSecretRetainPolicy to default to delete with support for retain for now.

Agreed.

raffaelespazzoli commented 1 year ago

guys I have been following the discussion. I have two questions:

  1. if we control this with an environment variable the behavior will be applied to all the random secrets. Should we (also) control this behavior via a field in the CR?
  2. if I chose to retain a secret and the RandomSecret CR is deleted and then recreated, what behavior do you expect?
davoustp commented 1 year ago

@raffaelespazzoli thanks for the (very relevant) questions! :-)

if we control this with an environment variable the behavior will be applied to all the random secrets. Should we (also) control this behavior via a field in the CR?

I think so - the envvar would be here to control the default policy, and the field to override this policy at resource level.

if I chose to retain a secret and the RandomSecret CR is deleted and then recreated, what behavior do you expect?

Unless the RandomSecret have periodic rotation defined, if it exists, leave it alone, as the sole goal of the rotation-less RandomSecret is to generate a Vault secret, which already exists in this case.

davoustp commented 1 year ago

To be totally honest, the downside that I see with the environment variable defining the default policy is that the behavior of the RandomSecret resources is not self-defined anymore: you need to know the value of the envvar (if set) in addition to the resource spec to understand how it behaves globally.

And I dislike this situation - a lot.

It could prove useful, but it will also create headaches: each time a new issue is created with RandomSecrets will raise the question about the envvar... or deploying the same resources in two similar environments may lead to different behavior just because this envvar is not set to the same value.

So now, I'm backtracking on this envvar proposal of mine - I think that we should drop it and keep only the resource field. This is more consistent in the long run.

Suggestions?

trevorbox commented 1 year ago

To be totally honest, the downside that I see with the environment variable defining the default policy is that the behavior of the RandomSecret resources is not self-defined anymore: you need to know the value of the envvar (if set) in addition to the resource spec to understand how it behaves globally.

And I dislike this situation - a lot.

It could prove useful, but it will also create headaches: each time a new issue is created with RandomSecrets will raise the question about the envvar... or deploying the same resources in two similar environments may lead to different behavior just because this envvar is not set to the same value.

So now, I'm backtracking on this envvar proposal of mine - I think that we should drop it and keep only the resource field. This is more consistent in the long run.

Suggestions?

I think it most reasonable to simply add the retention policy to the CR and default to delete with an option to retain. The status section in the RandomSecret CR could explain an override env variable but its probably not intuitive so lets exclude that idea for now.

davoustp commented 1 year ago

Ok. Let me summarize, this means:

@trevorbox @raffaelespazzoli @smnbbrv agreed?

If yes, I'll work along those lines and propose a PR.

smnbbrv commented 1 year ago

@davoustp LGTM, thank you! The default delete would also be aligned with https://kubernetes.io/docs/concepts/storage/storage-classes/#reclaim-policy

raffaelespazzoli commented 1 year ago

@davoustp I like it.

davoustp commented 3 days ago

Hi @trevorbox , @smnbbrv and @raffaelespazzoli Sorry, been busy :-( - and now I finally got a few hours to get back to this.

I just implemented what was discussed above a while ago (you can can have a look at fork above, code changes are quite small):

However, and before I create the PR, I failed to find any existing RandomSecret integration tests that I could enrich with these use cases. Are there any? If not, I'm not that confident to scaffold everything from scratch, to be honest... What would be the best course of actions?