kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.89k stars 39.61k forks source link

Enable deleting API objects even when storage-level decryption is not working properly #86489

Open immutableT opened 4 years ago

immutableT commented 4 years ago

What happened: Users are unable to delete secrets when kms provider (which originally encrypted such secrets) can no longer decrypt them. There may be several reasons why kms provider would fail to decrypt secrets, the most common one is that users deleted/disabled the version of the key that was used to originally encrypt secrets.

What you expected to happen: Secrets to be deleted.

How to reproduce it (as minimally and precisely as possible):

  1. Setup a cluster with a kms provider of your choice.
  2. Create a secret, validate that the secret is encrypted
  3. Reboot the cluster (this is required to clear the cache of Key Encryption Keys).
  4. Disable the key or key version that was used by the provider to encrypt the key in step 2.
  5. Attempt to delete the secret. You should get an internal error that wraps the kms-plugin's specific error (the error will vary based on the plugin).

Note: the issue is probably not unique to kms provider, but will manifest itself in any provider when the key that was used to encrypt the secret is no longer available. Anything else we need to know?: I believe that the cause of this behaviour is that fact that objects' metadata needs to be updated prior to deletion, which implies the need to transform from storage. However, such transformation is not possible due to the unavailability of the KEK. To address this issue we would need to read the metadata of the object (while processing a delete) even if the KEK is not available - after all, during a delete, we don't care about the payload. Therefore, to enable this scenario we would need to move away from encrypting the whole object. Concretely, parts of the metadata should remain in cleartext. I realize that this opens-up a lot of questions, and I could follow this issue up with a KEP.

Environment:

/cc @liggitt @mikedanese @enj /sig auth

liggitt commented 4 years ago

It is expected that inability to read from storage (including any required decryption from storage) will block access to the object (including read, write, and delete requests).

The deletion path requires checking things like finalizers, and finalizing controllers can take arbitrary actions via the API (reading and updating the object).

Adding support for partial object encryption would be an enhancement, not really a bug fix, and would definitely need a proposal.

immutableT commented 4 years ago

I will work on a KEP for this.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/86489#issuecomment-630469297): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k8s-ci-robot commented 1 year ago

@enj: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/86489#issuecomment-1469942065): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
enj commented 1 year ago

/triage accepted /assign @stlaz @ibihim

This was discussed at a high level in one of the last SIG auth meetings. The general idea is to propose a KEP (Standa and Krzysztof have been volunteered 😛) that would add two capabilities:

  1. Have the API server return which resource is failing to decrypt/decode in a structured way as part of the returned API status error
  2. Create a new field in delete options that allows the expression of "I want to delete this IFF there is a decrypt/decode error" (maybe with some form of dry run support?)

This would enable an external tool (kubectl plugin?) to be created (which could also be a SIG Auth sub-project) that allows an end user with sufficient access to the Kubernetes API (likely a cluster admin) to recover a cluster in which some subset of items cannot be decrypted/decoded (though maybe the tool could attempt partial recovery by asking for data from the watch cache?). An important aspect is that the determination of "is the bad state permanent/terminal" would be up to the end user (instead of the API server making decisions on behalf of the user).

Currently, direct access to etcd is required to delete items in this state.

k8s-triage-robot commented 7 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted