Closed JeffreyBenjaminBrown closed 7 years ago
Another way of thinking about it is that there is a 50% or greater chance of collision above 2.4 million total IDs:
> base <- 64
> digits <- 7
> 1.17 * sqrt(base ^ digits)
[1] 2453667.84
This was good enough for a personal knowledge base (e.g. my PKB has less than 100k atoms), but probably not good enough for networked knowledge bases. 9 digits would allow about 1.6 billion atoms before a 50% chance of collision, and would have the advantage of an easy 10-digit prefix (ID + space) in wiki views.
I have just changed the ID character set to Base-62 (excluding '-' and '_'), so this might be a good time to switch to longer IDs, as well. Using 16 digits, we would go from IDs like this:
N3gQEU6 NaxrqBQ 4K2isWR
to IDs like this:
I0UYRnGJfKTmETh8 OZ4LoCF6vGWjeQ2e KsZns10BC5mXU6J4
They're definitely not friendly to type, and they take up a lot of screen real estate when printed. However, you could mint around 200 trillion IDs before reaching that 50% chance of collision. They would be half as long as UUIDs, and more compact.
This is done; we now use copy-and-paste friendly Base-62 identifiers of 16 characters each. The probability of collision is very, very low.
Currently nodes are addressed using a random 7-character string from a 64 character alphabet. Merging two graphs induces some probability of an address conflict.
Suppose two graphs to be merged each had 1 million nodes. Let "k" = 64^7 = 4.4 * 10^12 be the number of possible node addresses.
The chance that any two nodes have the same address is exactly 1/k. The chance that node n1 from graph g1 matches one of the nodes in g2 is roughly one million times 1/k, or 10^6 / k. Roughly^2, then, the chance that some node in g1 matches some node in g2 is 10^12 / k, or about 1 in 4.4!
If you add a character to the length of the addresses, the likelihood of a collision falls by 1/64 (and not, as I hoped before working it out (see addendum) by 1/64^2).
If you add three characters, the likelihood of collision falls to less than one in a million.
I have not tested how the problem scales when you increase the number of nodes in the two mindmaps; I am guessing going from a million to ten million makes a collision a hundred times more likely, but without doing the math I can imagine how it might be ten thousand times.
--- Addendum: Why adding a character drops the likelihood by only 1/64, not 1/64^2 --- If you add a character, then k = 281 * 10**12, because