synchrony / smsn

Semantic Synchrony. An experiment in cognitive and sensory augmentation.
Other
178 stars 15 forks source link

Node address collision probability #36

Closed JeffreyBenjaminBrown closed 7 years ago

JeffreyBenjaminBrown commented 7 years ago

Currently nodes are addressed using a random 7-character string from a 64 character alphabet. Merging two graphs induces some probability of an address conflict.

Suppose two graphs to be merged each had 1 million nodes. Let "k" = 64^7 = 4.4 * 10^12 be the number of possible node addresses.

The chance that any two nodes have the same address is exactly 1/k. The chance that node n1 from graph g1 matches one of the nodes in g2 is roughly one million times 1/k, or 10^6 / k. Roughly^2, then, the chance that some node in g1 matches some node in g2 is 10^12 / k, or about 1 in 4.4!

If you add a character to the length of the addresses, the likelihood of a collision falls by 1/64 (and not, as I hoped before working it out (see addendum) by 1/64^2).

If you add three characters, the likelihood of collision falls to less than one in a million.

4.4 642 563.2 4.4 * 64*2 18022.4 4.4 64**3 1153433.6

I have not tested how the problem scales when you increase the number of nodes in the two mindmaps; I am guessing going from a million to ten million makes a collision a hundred times more likely, but without doing the math I can imagine how it might be ten thousand times.

--- Addendum: Why adding a character drops the likelihood by only 1/64, not 1/64^2 --- If you add a character, then k = 281 * 10**12, because

648 / (1012) 281.474976710656 }}} · :Kp6QNqq: suppose again they have 1 million nodes in their graph · :Y1HjCFw: the change that any two nodes have the same address is exactly 1/k. the chance that node n1 from graph g1 matches one of the nodes in g2 is roughly 10^6/k = 1/10^6. Roughly^2, then, the chance that some node in g1 matches some node in g2 is 10^12 / k = 1 / 281 · :uReh8YR: 281 / 4.4 is about 64 · :mcwQEtW: thus adding a digit, the likelihood of a collision falls by 64 (not 64^2)

  • :-bbnF0c: by the time you've added three characters, the likelihood of a collision between two million-node graphs falls to less than a one in a million
joshsh commented 7 years ago

Another way of thinking about it is that there is a 50% or greater chance of collision above 2.4 million total IDs:

> base <- 64
> digits <- 7
> 1.17 * sqrt(base ^ digits)
[1] 2453667.84

This was good enough for a personal knowledge base (e.g. my PKB has less than 100k atoms), but probably not good enough for networked knowledge bases. 9 digits would allow about 1.6 billion atoms before a 50% chance of collision, and would have the advantage of an easy 10-digit prefix (ID + space) in wiki views.

joshsh commented 7 years ago

I have just changed the ID character set to Base-62 (excluding '-' and '_'), so this might be a good time to switch to longer IDs, as well. Using 16 digits, we would go from IDs like this:

N3gQEU6 NaxrqBQ 4K2isWR

to IDs like this:

I0UYRnGJfKTmETh8 OZ4LoCF6vGWjeQ2e KsZns10BC5mXU6J4

They're definitely not friendly to type, and they take up a lot of screen real estate when printed. However, you could mint around 200 trillion IDs before reaching that 50% chance of collision. They would be half as long as UUIDs, and more compact.

joshsh commented 7 years ago

This is done; we now use copy-and-paste friendly Base-62 identifiers of 16 characters each. The probability of collision is very, very low.