Closed dgrisonnet closed 2 weeks ago
/sig auth
/kind feature /language en
and also: /sig security
Hang on. /remove-kind feature /kind bug
This question / issue is similar to https://stackoverflow.com/questions/70556732/why-aes-gcm-with-random-nonce-should-be-rotated-every-200k-writes-in-kubernetes
I think the concern is that the API server isn't guaranteed to have a cryptographically trustworthy source of nonces.
Nonce-Disrespecting Adversaries: Practical Forgery Attacks on GCM in TLS states (for a similar but different context):
For safety reasons random nonces should be avoided and a counter should be used.
Thank you for the references, I will have a deeper look at Nonce-Disrespecting Adversaries: Practical Forgery Attacks on GCM in TLS.
For safety reasons random nonces should be avoided and a counter should be used.
My understanding is that the current implementation required by the NIST has a recommendation of 12 bytes nonce with a 4 bytes counter that is implemented by go/crypto in: https://cs.opensource.google/go/go/+/refs/tags/go1.20:src/crypto/cipher/gcm.go;l=388-409;drc=0cd309e12818f988693bf8e4d9f1453331dcf9f2;bpv=0;bpt=1
/assign
As far as I am understanding the document you shared, AES-GCM failure in TLS was mostly due to the lack of guidance and requirements towards its implementations that made most implementations likely to run into nonce duplication.
In our case go/crypto is using random nonce generation and according to the TLS doc, we should have a collision probability of:
If choosing nonces at random after 2^28 encryptions the probability of a nonce collision will be around 0,2 % due to the birthday paradox.
I am not knowledge enough to be able to get an actual collision estimation for our scenario, but that already sounds insanely high and very unlikely compare to our requirements of rotating the keys every 200k writes.
And if we were to apply this probability to the Kubernetes scenario where encrypted data is being replaced in etcd on every updates to the objects, it is very unlikely that we end up in a scenario where the same nonce is used in two encrypted objects that are present in etcd at the same time.
The nonces in that paper are 64 bits, which is less than the usual 96 bits used by Go by default. I read that the probability of a collision due to the birthday paradox should be 1 in 4 billion. I tried to use approximations in both Haskell and Wolfram, and both show a probability of 1.16 * 10^-10, which is one in 8.59 billion.
A strong random source is necessary to ensure security. There have been papers that showed similar RSA keys on thousands of servers because they generated them immediately after boot. However, this should not be an issue. Modern Linux kernels usually generate random numbers quickly, and there is a certain event sent by Linux when it has fed its CSPRNG. If the randomness of the key itself is not guaranteed, then the nonce is a smaller issue... and lets not talk about the certs generated before.
It is documented here as well: apiserver/aes.go.
@smarterclayton do you perhaps remember why 200k writes was chosen for the aesgcm key rotation requirements? There seem to be a common misunderstanding where end users believe that 200k is a hard limit, but as far as we researched in this issue, it seems that if we were to use an aesgcm key 2^32 times we would have a risk of collision of one in ~8.59 billion. So, from this observation 2^32 should already be a conservative enough value.
There is another problem with recommending write based rotation today which is that there is no real way for end users to know how many time a key was used because the apiserver doesn't maintain of state of number of writes per key. So, although the write base approach is more appropriate for encryption key rotation, I think it would be better to recommend safe time based thresholds for now until the apiserver can provide accurate measurements. And considering how long it would take to use an aesgcm 2^32 times, a time based rotation should be safe as long as the threshold is reasonable.
I discussed that topic offline with @enj and he was fine with updating the doc with more accurate values as long as we explain the math behind them and the threat model it was based on.
@Sajiyah-Salat I hope you don't mind, but I'll assign this issue to myself since its resolution is quite important to me.
/assign
@dgrisonnet this is one for SIG Auth to triage - you might like to mark it accepted
Sure, I already discussed this one with Mo, so it should be fine to triage it. Thanks for the reminder :)
/triage accepted
From rusty memory:
I’m pretty sure we calculated how hard it would be for an adversarial client to force a collision who have access to the encrypted etcd value (MITM between apiserver and etcd, ability to snoop disk on etcd server but not memory). Today the sustained write rate of a cluster is about 1000 writes/sec for some of the busiest clusters. Every write is a new opportunity for a collision on the nonce. An attacker that can observe the encrypted contents in etcd would know the plaintext (or very close to it) and would know the encrypted value. That’s 84,600,000 plaintext / encrypted pairs per day - once any nonce was reused, all writes from that counter impl are at risk. So we applied the birthday problem for how many reused writes we expected in an interval, not for a collision.
Edit: That’s 50 days before you exceed the 2^32 uses recommendation, which was plausible for an attacker on a poorly monitored system. So because we didn’t want people to be vulnerable to this, we didn’t recommend it. Does Go’s counter implementation mitigate this vector?
https://github.com/kubernetes/kubernetes/pull/41939/files#diff-858d0b1c7cefcadbaae3c7c7d1c6b36da867f765640400da7db296c3f69eb9e5R36 is the earliest reference and the math seems like what I remember.
From rusty memory:
I’m pretty sure we calculated how hard it would be for an adversarial client to force a collision. Today the sustained write rate of a cluster is about 1000 writes/sec for some of the busiest clusters. Every write is a new opportunity for a collision on the nonce. An attacker that can observe the encrypted contents in etcd would know the plaintext (or very close to it) and would know the encrypted value. That’s 84,600,000 plaintext / encrypted pairs per day - once any nonce was reused, all writes from that counter impl are at risk. So we applied the birthday problem for how many reused writes we expected in an interval, not for a collision.
Edit: That’s 50 days before you exceed the 2^32 uses recommendation, which was plausible for an attacker on a poorly monitored system. So because we didn’t want people to be vulnerable to this, we didn’t recommend it. Does Go’s counter implementation mitigate this vector?
https://github.com/kubernetes/kubernetes/pull/41939/files#diff-858d0b1c7cefcadbaae3c7c7d1c6b36da867f765640400da7db296c3f69eb9e5R36 is the earliest reference and the math seems like what I remember.
(my emphasis)
If the attacker knows the plaintext, no amount of key rotation will help with confidentiality. Is that the wrong word, or am I missing something?
Ah, it's a client that uses chosen plaintext and tries to force a collision. Got it.
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
The official documentation for encryption at rest providers has some pretty confusing security guidelines for AES-GCM: https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/#providers.
I am no security expert but the mention of having to rotate an AES-GCM key every 200k writes is pretty confusing to me and I couldn't find the origin of this value anywhere. In k/k we are using AES-GCM with the default random nonce of 96 bits provided by go/crypto which should allow for way more key use than the theoretical 200k from the doc.
If we rely on the guidelines from the NIST in https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=51288#page=29, we should theoretically be able to use an AES-GCM key up to 2^32 times with a very low risk of having a nonce collision. Even in the event where there is a collision, the probability that the encrypted objects with the same nonce are present at the same time in etcd is very small making the threat level very low.
/cc @enj @ibihim @tkashem