reside-ic / ids

:information_source: Simple random identifiers
https://reside-ic.github.io/ids
Other
93 stars 3 forks source link

proquint() generates duplicate ids with relatively small n #8

Open Raesu opened 3 years ago

Raesu commented 3 years ago

If I'm not mistaken, running proquint with the default argument n_words=2L should be able to generate 2^32 (4,294,967,296) unique identifiers.

When I run proquint with just n=121969, I consistently get duplicates. You can test this with the below:

proquint(121969) |> unique() |> length() which in my case is always 2 to 3 less than 121,969. Am I missing something with how this should work?

richfitz commented 3 years ago

This was a bit alarming, but it turns out that it's correct. The proquint method runs by selecting n_words * n 16 bit integers, then converting them into "words". This should give a keyspace of 2^32 (~4 billion) as you say above.

Skipping out any code in the package, and looking at selecting ~121k 32 bit integers, and looking for duplicates:

set.seed(1)
n <- 121969
y <- replicate(100, sum(duplicated(sample(2^32, n, replace = TRUE))))
table(y)
#  0  1  2  3  4  5  6  7 
# 16 35 23 18  4  2  1  1 

which is consistent with what I saw from proquint based on your example.

Using the "birthday paradox distribution", we only need to draw 77k samples to have a 50% chance of a collision:

qbirthday(classes = 2^32)
# [1] 77164

@r-ash found another way of looking at this problem here - the expected number of collisions can be computed by the probability that the i'th term appears twice (ignoring triple-collisions)

2^32 * dbinom(2, n, 1 / 2^32)
# [1] 1.731782
mean(y)
# [1] 1.74

Typically when I've needed large keyspaces I've used ids::random_id which uses by default 16 bytes (so 2^128). That breaks qbirthday but gets us into healthy territory (based on log-log extrapolation I make it that we'd need to draw ~10^19 samples to have a 50% chance of duplicate).

Raesu commented 3 years ago

Ok, good to know this isn't a bug. It would be nice if ids::proquint had an argument unique where it resamples any duplicates before returning the vector. For now I solved my problem by using n=3. I would use ids::random_id but find proquints to be fun.