Security information about AES-GCM key rotation are misleading

dgrisonnet commented 1 year ago

The official documentation for encryption at rest providers has some pretty confusing security guidelines for AES-GCM: https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/#providers.

I am no security expert but the mention of having to rotate an AES-GCM key every 200k writes is pretty confusing to me and I couldn't find the origin of this value anywhere. In k/k we are using AES-GCM with the default random nonce of 96 bits provided by go/crypto which should allow for way more key use than the theoretical 200k from the doc.

If we rely on the guidelines from the NIST in https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=51288#page=29, we should theoretically be able to use an AES-GCM key up to 2^32 times with a very low risk of having a nonce collision. Even in the event where there is a collision, the probability that the encrypted objects with the same nonce are present at the same time in etcd is very small making the threat level very low.

/cc @enj @ibihim @tkashem

dgrisonnet commented 1 year ago

/sig auth

sftim commented 1 year ago

/kind feature /language en

and also: /sig security

sftim commented 1 year ago

Hang on. /remove-kind feature /kind bug

sftim commented 1 year ago

This question / issue is similar to https://stackoverflow.com/questions/70556732/why-aes-gcm-with-random-nonce-should-be-rotated-every-200k-writes-in-kubernetes

I think the concern is that the API server isn't guaranteed to have a cryptographically trustworthy source of nonces.

Nonce-Disrespecting Adversaries: Practical Forgery Attacks on GCM in TLS states (for a similar but different context):

For safety reasons random nonces should be avoided and a counter should be used.

dgrisonnet commented 1 year ago

Thank you for the references, I will have a deeper look at Nonce-Disrespecting Adversaries: Practical Forgery Attacks on GCM in TLS.

For safety reasons random nonces should be avoided and a counter should be used.

My understanding is that the current implementation required by the NIST has a recommendation of 12 bytes nonce with a 4 bytes counter that is implemented by go/crypto in: https://cs.opensource.google/go/go/+/refs/tags/go1.20:src/crypto/cipher/gcm.go;l=388-409;drc=0cd309e12818f988693bf8e4d9f1453331dcf9f2;bpv=0;bpt=1

Sajiyah-Salat commented 1 year ago

/assign

dgrisonnet commented 1 year ago

As far as I am understanding the document you shared, AES-GCM failure in TLS was mostly due to the lack of guidance and requirements towards its implementations that made most implementations likely to run into nonce duplication.

In our case go/crypto is using random nonce generation and according to the TLS doc, we should have a collision probability of:

If choosing nonces at random after 2^28 encryptions the probability of a nonce collision will be around 0,2 % due to the birthday paradox.

I am not knowledge enough to be able to get an actual collision estimation for our scenario, but that already sounds insanely high and very unlikely compare to our requirements of rotating the keys every 200k writes.

And if we were to apply this probability to the Kubernetes scenario where encrypted data is being replaced in etcd on every updates to the objects, it is very unlikely that we end up in a scenario where the same nonce is used in two encrypted objects that are present in etcd at the same time.

ibihim commented 1 year ago

The nonces in that paper are 64 bits, which is less than the usual 96 bits used by Go by default. I read that the probability of a collision due to the birthday paradox should be 1 in 4 billion. I tried to use approximations in both Haskell and Wolfram, and both show a probability of 1.16 * 10^-10, which is one in 8.59 billion.

A strong random source is necessary to ensure security. There have been papers that showed similar RSA keys on thousands of servers because they generated them immediately after boot. However, this should not be an issue. Modern Linux kernels usually generate random numbers quickly, and there is a certain event sent by Linux when it has fed its CSPRNG. If the randomness of the key itself is not guaranteed, then the nonce is a smaller issue... and lets not talk about the certs generated before.

ibihim commented 1 year ago

It is documented here as well: apiserver/aes.go.

dgrisonnet commented 1 year ago

@smarterclayton do you perhaps remember why 200k writes was chosen for the aesgcm key rotation requirements? There seem to be a common misunderstanding where end users believe that 200k is a hard limit, but as far as we researched in this issue, it seems that if we were to use an aesgcm key 2^32 times we would have a risk of collision of one in ~8.59 billion. So, from this observation 2^32 should already be a conservative enough value.

There is another problem with recommending write based rotation today which is that there is no real way for end users to know how many time a key was used because the apiserver doesn't maintain of state of number of writes per key. So, although the write base approach is more appropriate for encryption key rotation, I think it would be better to recommend safe time based thresholds for now until the apiserver can provide accurate measurements. And considering how long it would take to use an aesgcm 2^32 times, a time based rotation should be safe as long as the threshold is reasonable.

dgrisonnet commented 1 year ago

I discussed that topic offline with @enj and he was fine with updating the doc with more accurate values as long as we explain the math behind them and the threat model it was based on.

dgrisonnet commented 1 year ago

@Sajiyah-Salat I hope you don't mind, but I'll assign this issue to myself since its resolution is quite important to me.

/assign

sftim commented 1 year ago

@dgrisonnet this is one for SIG Auth to triage - you might like to mark it accepted

dgrisonnet commented 1 year ago

Sure, I already discussed this one with Mo, so it should be fine to triage it. Thanks for the reminder :)

/triage accepted

smarterclayton commented 1 year ago

From rusty memory:

I’m pretty sure we calculated how hard it would be for an adversarial client to force a collision who have access to the encrypted etcd value (MITM between apiserver and etcd, ability to snoop disk on etcd server but not memory). Today the sustained write rate of a cluster is about 1000 writes/sec for some of the busiest clusters. Every write is a new opportunity for a collision on the nonce. An attacker that can observe the encrypted contents in etcd would know the plaintext (or very close to it) and would know the encrypted value. That’s 84,600,000 plaintext / encrypted pairs per day - once any nonce was reused, all writes from that counter impl are at risk. So we applied the birthday problem for how many reused writes we expected in an interval, not for a collision.

Edit: That’s 50 days before you exceed the 2^32 uses recommendation, which was plausible for an attacker on a poorly monitored system. So because we didn’t want people to be vulnerable to this, we didn’t recommend it. Does Go’s counter implementation mitigate this vector?

https://github.com/kubernetes/kubernetes/pull/41939/files#diff-858d0b1c7cefcadbaae3c7c7d1c6b36da867f765640400da7db296c3f69eb9e5R36 is the earliest reference and the math seems like what I remember.

sftim commented 1 year ago

From rusty memory:

I’m pretty sure we calculated how hard it would be for an adversarial client to force a collision. Today the sustained write rate of a cluster is about 1000 writes/sec for some of the busiest clusters. Every write is a new opportunity for a collision on the nonce. An attacker that can observe the encrypted contents in etcd would know the plaintext (or very close to it) and would know the encrypted value. That’s 84,600,000 plaintext / encrypted pairs per day - once any nonce was reused, all writes from that counter impl are at risk. So we applied the birthday problem for how many reused writes we expected in an interval, not for a collision.

Edit: That’s 50 days before you exceed the 2^32 uses recommendation, which was plausible for an attacker on a poorly monitored system. So because we didn’t want people to be vulnerable to this, we didn’t recommend it. Does Go’s counter implementation mitigate this vector?

https://github.com/kubernetes/kubernetes/pull/41939/files#diff-858d0b1c7cefcadbaae3c7c7d1c6b36da867f765640400da7db296c3f69eb9e5R36 is the earliest reference and the math seems like what I remember.

(my emphasis)

If the attacker knows the plaintext, no amount of key rotation will help with confidentiality. Is that the wrong word, or am I missing something?

sftim commented 1 year ago

Ah, it's a client that uses chosen plaintext and tries to force a collision. Got it.

k8s-triage-robot commented 5 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/website/issues/39477#issuecomment-2364958954): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / website

Security information about AES-GCM key rotation are misleading #39477