kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.69k stars 39.56k forks source link

apiserver does not retry generateName on collisions #115489

Open thockin opened 1 year ago

thockin commented 1 year ago

xref https://github.com/kubernetes/community/pull/7063

I am sure that such collisions are rare (though I didn't see a metric - could add one!) but it seems surprising that our API offers a "generate a name for me" option which can fail with "name in use".

The docs used to imply that apiserver will retry upon name-conflict, but it does not seem to actually do so.

See staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go Store.Create() which calls objectMeta.SetName(e.CreateStrategy.GenerateName(...)) one time, before passing off to e.Storage.Create() and then looking at the result with rest.CheckGeneratedNameError().

I thought maybe this is retried higher up-stack, but I couldn't find it and @lavalamp and @liggitt seemed to concur.

We agreed it was surprising and this issue is a chance to discuss if it is a bug or not.

lavalamp commented 1 year ago

We have to bound the amount of effort apiserver spends on this. Doing N retries in apiserver doesn't guarantee success, either.

My guess is that someone ran the numbers and was like, with 1,000,000 items you've got only a 7% chance of getting a dup.

3 internal retries could get that down to a ~.03% chance. Is the complexity worth reducing the amount of times users see a collision?

For a reasonable size of 10k items, a single try is already only a .06% chance of a duplicate.

(note that this is not a case for using a birthday estimation, since the question that answers is "was a duplicate ever generated in the process of generating all of the names")

Note that absolute disaster strikes when there are 7M items (half as many as the random name generator can generate). Starting at that point the math begins to work against us and random retries becomes a terrible algorithm, the bogosort of naming.

This is something we would have to fix to get clusters ~2+ orders of magnitude larger than today, I think.

ninjaprox commented 1 year ago

Would increasing the length of the random string also reduce the probability? What is its cost compared to server retries?

thockin commented 1 year ago

The random-suffix length comes out of the total available name length, which is somewhat restricted for some resources. Making it longer just reduces the user-available size. We already do not handle this very cleanly (e.g. deployment can be long, then replicaset adds a suffix, then pod adds a suffix).

This is all true, and yet, in practice, it hasn't been a major problem. Adding a single extra random digit grows from ~14M possible to ~387M - but I am not sure it's worthwhile. If Daniel's math is right (I didn't check it, but I have no reason to doubt) it is a pretty small risk. Adding to that, the largest user of generateName is ReplicaSet, which will retry on failure anyway.

So maybe the answer here is just "add a metric".

It's also worth noting that the random suffix added to each replicaset is ELEVEN characters (10 + "-"). That seems excessive. If we made that be 6 (or seven) and made generateName be 6 (or seven) we'd have shorter names overall, lower risk of leaf-pod collisions, and almost certainly still be random enough. Also it seems Deployment->RS doesn't use generateName - not sure why, maybe it's a hash of something (they all look like hex), which might make sense - but I didn't look. Still it could

We could go one further and change generateName to take a configurable suffix length, so if there are extra characters available we can use them e.g.:

# uses a standard 5 character suffix
generateName: foobar-
# uses a 7 character suffix
generateName: foobar-*******  # could be XXXXXX like mktemp,but * seems
less ambiguous

That's pretty compatible. Or even simpler, make generateName use at least 5 characters, but up to 8 if the space is available. Or even randomize the suffix length between 5 and 8 - giving > 293 billion possible names.

All this without adding a retry loop. Worthwhile?

On Fri, Feb 3, 2023 at 7:36 AM Vinh Nguyen @.***> wrote:

Would increasing the length also reduce the probability? What is its cost compared to server retries?

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/115489#issuecomment-1416036884, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVCTB7GLQ7LBWODQGFTWVUQV7ANCNFSM6AAAAAAUPSJDQU . You are receiving this because you authored the thread.Message ID: @.***>

lavalamp commented 1 year ago

Increasing the character set would help too (could add some capital letters) -- problem is I'm not sure if uppercase is permitted for every collection.

Another approach is to make this more deterministic, one possibility:

Imagine an object we store in etcd which is a list of N buckets, each a single uint64.

On startup apiserver reads this object (stores it locally), then writes back a copy with every number incremented by M. (M starts small, like 100, and is stored locally per-bucket; repeat until no conflicts)

To use this, you hash the prefix to select the bucket, take the number, hash that together with the prefix[*], write it in base 27, truncate (effectively a mod operation), use that as the suffix, and increment the number in the bucket.

If you hit the starting number + M for a bucket, round trip to etcd again to reserve another block of numbers just for that bucket. Double M when you do this up to some maximum value to make this process rarer for buckets that are getting more use.

I think this system would let you use a greater % of the possible suffixes before it fails catastrophically. But it is more complex and we don't need it for another couple orders of magnitude of cluster size.

To get bigger than this, we simply need to make the space of possible suffixes bigger, or make some other drastic change.

[*] the second hash is to make sure different prefixes use different sequences, which should improve rollover behavior.

thockin commented 1 year ago

Increasing the character set would help too (could add some capital letters) -- problem is I'm not sure if uppercase is permitted for every collection.

upper-case is not permitted, nor are punctuation (other than -) in most names. Adding back all of the vowel-ish characters only quadruples the space, adding one extra character multiplies by 27 :)

it is more complex and we don't need it for another couple orders of magnitude of cluster size.

To get bigger than this, we simply need to make the space of possible suffixes bigger, or make some other drastic change.

even simpler, make generateName use at least 5 characters, but up to 8 if the space is available.

For a given prefix, this yields (worst case) 14M possible names, but only if your prefix is >58 letters long. For the vast majority (anecdotally) of prefixes this yields > 280 billion possible names.

If that isn't enough, allow 10 - 205 TRILLION names possible.

cici37 commented 1 year ago

/triage accepted

thockin commented 1 year ago

I posted #116430 to see how bad it would be.

kaovilai commented 1 year ago

Adding that we are seeing this collision happening and would love for retry to happen.

https://issues.redhat.com/browse/OADP-1912

sseago commented 1 year ago

I hit this several times when I was creating a few thousand (30k total, if I remember correctly) k8s resources from a bash script for scale testing as well.

qiuming-best commented 1 year ago

I've done one test of String(n int) string function, below is statistics of the results of a run: 5 char for 1k calls, repeated: 0, repeat rate: 0.00% 5 char for 5k calls, repeated: 1, repeat rate: 0.02% 5 char for 10k calls, repeated: 5, repeat rate: 0.05% 5 char for 100k calls, repeated: 315, repeat rate: 0.32% 5 char for 1000k calls, repeated: 34012, repeat rate: 3.40%

increasing the length of random alphanumeric string: 6 char for 1k calls, repeated: 0, repeat rate: 0.00% 6 char for 5k calls, repeated: 0, repeat rate: 0.00% 6 char for 10k calls, repeated: 0, repeat rate: 0.00% 6 char for 100k calls, repeated: 6, repeat rate: 0.01% 6 char for 1000k calls, repeated: 1330, repeat rate: 0.13%

The effect is most obvious when increasing the length of random alphanumeric string(The adjustments need to be made by the upstream, and it may not be possible or progress could be very slow.), and retry when a collision happening also works(Adjust velero).

thockin commented 9 months ago

Recapping, almost a year later (sorry):

Some which I reported last year, all of which still exist: https://github.com/rancher/rancher/issues/40852 https://github.com/hashicorp/terraform-provider-kubernetes/issues/2037 https://github.com/gocrane/crane/issues/716 https://github.com/telepresenceio/telepresence/issues/3063

An alternate path (https://github.com/kubernetes/kubernetes/pull/116430#discussion_r1133011818) is to make generateName take an optional suffix indicating number of random characters (e.g. name-%%%%%%%%%%). This doesn't really fix downstreams - they will break once replicaset controller (or whatever they depend on) changes from 5 to any other number.

At what point to we embrace the ethos of "we told you not to do that"?