Open Raesu opened 3 years ago
This was a bit alarming, but it turns out that it's correct. The proquint method runs by selecting n_words * n
16 bit integers, then converting them into "words". This should give a keyspace of 2^32 (~4 billion) as you say above.
Skipping out any code in the package, and looking at selecting ~121k 32 bit integers, and looking for duplicates:
set.seed(1)
n <- 121969
y <- replicate(100, sum(duplicated(sample(2^32, n, replace = TRUE))))
table(y)
# 0 1 2 3 4 5 6 7
# 16 35 23 18 4 2 1 1
which is consistent with what I saw from proquint based on your example.
Using the "birthday paradox distribution", we only need to draw 77k samples to have a 50% chance of a collision:
qbirthday(classes = 2^32)
# [1] 77164
@r-ash found another way of looking at this problem here - the expected number of collisions can be computed by the probability that the i'th term appears twice (ignoring triple-collisions)
2^32 * dbinom(2, n, 1 / 2^32)
# [1] 1.731782
mean(y)
# [1] 1.74
Typically when I've needed large keyspaces I've used ids::random_id
which uses by default 16 bytes (so 2^128). That breaks qbirthday but gets us into healthy territory (based on log-log extrapolation I make it that we'd need to draw ~10^19 samples to have a 50% chance of duplicate).
Ok, good to know this isn't a bug. It would be nice if ids::proquint
had an argument unique
where it resamples any duplicates before returning the vector. For now I solved my problem by using n=3
. I would use ids::random_id
but find proquints to be fun.
If I'm not mistaken, running
proquint
with the default argumentn_words=2L
should be able to generate 2^32 (4,294,967,296) unique identifiers.When I run
proquint
with justn=121969
, I consistently get duplicates. You can test this with the below:proquint(121969) |> unique() |> length()
which in my case is always 2 to 3 less than 121,969. Am I missing something with how this should work?