We store too many large secrets in etcd

smarterclayton commented 7 years ago

The primary scale limiter on large dense clusters is secrets (namespaces * 3_service_accounts_per_namespace * 3_secrets_per_service_account * ~2-10k, i.e. on a large 10k namespace cluster it's 90% of the cluster size). We can try to reduce the constant scale elements now (gzip, ca.crts secrets per SA, SAs) and target structural changes that fit our long term direction (to remove service account tokens from storage and embrace dynamic secret generation).

Drafted some thoughts in kubernetes/kubernetes#48408 for the latter. Need to think through the former. This gates scale growth on any of our existing clusters because we can't afford a 500mb download everytime we want to list secrets.

This is an online-first item (newly created!)

smarterclayton commented 7 years ago

@openshift/security @openshift/scale-performance-review

mfojtik commented 7 years ago

@smarterclayton do we support gzip in API server? I guess that should reduce the transfer significantly.

jeremyeder commented 7 years ago

Can you somehow get a histogram of secret sizes, and perhaps their access frequency?

simo5 commented 7 years ago

One quick win could be to reference the (common to almost all) CA.crt in the JWT via a custom attribute, so that the ca.crt can be stored only once and looked up/cached as needed ?

smarterclayton commented 7 years ago

@jeremyeder the access frequency of secrets is irrelevant, because all that is broken right now is LIST /secrets which has to fetch 500mb of data from etcd, then 500mb of data is returned to the calling client. The distribution is almost flat, because we generate these secrets ourselves. You can see yourself on any openshift system:

create a namespace
look at the created secrets (9 of them). 3 are dockercfg secrets and double base64 encoded which encode another three. they are proportional to the size of the master ca.crt

Only 10% of the secrets on this cluster are not these auto-generated ones, i'll get a distribution for you later. I don't expect them to matter until we deal with the 90%

@simo5 re: referenced - because of the current structure of the API we have no way to "fake" the CA.crt so it isn't stored in the api (the ca isn't in the jwt, it's laid down alongside it, but both are in the same secret).

@mfojtik probably won't make a different to the api server (which is one of the bottlenecks) because it has to unzip (more CPU), then decode (same amount of memory), then reencode (same), then gzip to client (more CPU). It might help and is worth testing, but I suspect only a total size reduction will matter.

deads2k commented 7 years ago

@smarterclayton do we support gzip in API server? I guess that should reduce the transfer significantly.

https://github.com/openshift/origin/pull/14958 adds compression. @ilackarms generated the data before I agreed to merge it into kube. It was pretty compelling.

deads2k commented 7 years ago

It was pretty compelling.

Compelling for the client, which in the case of all secrets is commonly our controllers. They are currently relisting over and over.

ilackarms commented 7 years ago

@deads2k @smarterclayton results of those compression benchmarks are here: https://github.com/ilackarms/openshift-compression-performance

deads2k commented 7 years ago

@deads2k @smarterclayton results of those compression benchmarks are here: https://github.com/ilackarms/openshift-compression-performance

If you've got the analyzer handy, reset that last group to start Y-axis at zero. It's closer in terms of CPU than your chart illustrates.

ilackarms commented 7 years ago

@deads2k updated, using the same data (just re-drew the chart)

smarterclayton commented 7 years ago

From us-east-1

$ oc get secrets builder-dockercfg-wpghm -o json | wc -c
    7613
$ oc get secrets builder-dockercfg-wpghm -o json | gzip -c - | wc -c
    4810
$ oc get secrets builder-token-9r6nb -o json | wc -c
    6322
$ oc get secrets builder-token-9r6nb -o json | gzip -c - | wc -c
    3210

Seems unlikely to move the needle, since ca.crt and JWT are both mostly incompressible.

May be worth a test. Unfortunately from a complexity perspective storing gzipped content will require being a transformer implementation and is unlikely to be trivial.

deads2k commented 7 years ago

Seems unlikely to move the needle, since ca.crt and JWT are both mostly incompressible.

But usually identical in all the secrets, so you get greater compression in a larger list. It's the list that kills us.

smarterclayton commented 7 years ago

This isn't just about lists, it's about memory on the read and decode side from etcd. Etcd alone (which I can reproduce by simply listing) is being crushed by secret reads. The master also chokes on decode

I was commenting about encryption in storage, which is the only tool we have to reduce load size from etcd. Those are per key and can't benefit from per secret reuse.

On Jul 3, 2017, at 4:53 PM, David Eads notifications@github.com wrote:

Seems unlikely to move the needle, since ca.crt and JWT are both mostly incompressible.

But usually identical in all the secrets, so you get greater compression in a larger list.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/15011#issuecomment-312731796, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p2fzJjOil5ZnMrWaSL6z_svZxInaks5sKVSpgaJpZM4OLzE0 .

smarterclayton commented 7 years ago

Jordan and I sketched out a likely approach for avoiding the need to store pull secrets.

In OpenShift, we enable pods to automatically have access to images via the service account they run under. The access today is delivered via an auto injected pull secret, which is created by a controller based on a new service account token for each service account. The node is being instructed to use the service account of the pod when attempting to pull images for things that come from the internal registry.

Today on the nodes, a set of credential providers can also supply pull secrets implicitly at pull time. These are node or infrastructure level credentials. They currently do not accept context about who the secret is intended for. It should be possible to implement a credential provider that, given context about what pod is being retrieved, to return an appropriately scoped credential that the integrated registry can verify allows node A to pull image B on behalf of service account C under pod D (which must be scheduled onto node A). This might be an opt in provider, but would have no more permission than the node already has (node can access secrets a pod references).

Removing the need to store pull secrets would take us back to one service account token per service account (remove the dockercfg and extra token) which is ~60% of size. It would remove the need to keep the secret up to date when the registry URL changes. Image stream import might be affected (since it uses dockercfg secrets in the current namespace). Ideally, the node would use its client cert to authenticate to pull (not supported in docker today), but it could potentially request a limited time token from the master to pull as well.

If we move to a node generated token for pod identity (essentially, pod is given a service account token scoped to its use) then the node could pass this to the integrated registry or generate an even tighter scoped token.

enj commented 7 years ago

return an appropriately scoped credential that the integrated registry can verify allows node A to pull image B on behalf of service account C under pod D (which must be scheduled onto node A)

So we are going to build a registry authorizer? That statement just reminded me of the node authorizer's graph based approach.

smarterclayton commented 7 years ago

No, the registry simply passes on the request, and the node authorizer may or may not need to be extended to also check transitive pull. This is more of a sketch than full design, but permission to view pull secrets is equivalent to permission to use pull secrets which is equivalent to permission to pull. If done right it's better, because there is no transient secret that could be misappropriated.

deads2k commented 7 years ago

I'm not seeing how this sketch handles the case for the build controller choosing a secret to push with. Are you going to extend that controller too? I'm not sure of all the knowledge currently available at that point in the code.

smarterclayton commented 7 years ago

The build controller tells the build pod what secret to use for pushing and for pulling. Whether the pod changes or the controller changes doesn't matter, except as how the implicit pull secret can be used. Implicit pull secret would probably need to be service account token, or instructing the build to use its local service account token to do the pull (which depends on what happens in the pull request).

This is about pull secrets and their representation without builds primarily.

On Mon, Aug 7, 2017 at 10:32 AM, David Eads notifications@github.com wrote:

I'm not seeing how this sketch handles the case for the build controller choosing a secret to push with. Are you going to extend that controller too? I'm not sure of all the knowledge currently available at that point in the code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/origin/issues/15011#issuecomment-320680341, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG_p1VQM9oTXTCVxGrHmdU6d1OVbs9vks5sVyAVgaJpZM4OLzE0 .

deads2k commented 7 years ago

The build controller tells the build pod what secret to use for pushing and for pulling. Whether the pod changes or the controller changes doesn't matter, except as how the implicit pull secret can be used. Implicit pull secret would probably need to be service account token, or instructing the build to use its local service account token to do the pull (which depends on what happens in the pull request).

Maybe I'm missing a bit, but isn't this talking about not creating the pull secret that the build pod attempts to mount? I'm not seeing how the build controller's technique of "find mountable image pull secrets and mount it in my pod" works if the pull secret can't be mounted.

liggitt commented 7 years ago

I'm not seeing how the build controller's technique of "find mountable image pull secrets and mount it in my pod" works if the pull secret can't be mounted.

the build controller would be told what IP addresses / DNS names have registries that accept service account tokens as credentials
the build pods would gain two config bits the build controller could set:
- form a pull secret with their service account credentials
- form a push secret with their service account credentials

deads2k commented 7 years ago

the build pods would gain two config bits the build controller could set: form a pull secret with their service account credentials form a push secret with their service account credentials

And information about what IPs and dns names the SA token is good for.

openshift-bot commented 6 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

enj commented 6 years ago

/lifecycle frozen

enj commented 4 years ago

/unassign

@stlaz @sttts @mfojtik

openshift / origin

We store too many large secrets in etcd #15011