paralleldrive / cuid2

Next generation guids. Secure, collision-resistant ids optimized for horizontal scaling and performance.
MIT License
2.65k stars 54 forks source link

Fingerprints for non-JS ports #27

Closed mplanchard closed 1 year ago

mplanchard commented 1 year ago

Working on adding cuid2 to the Rust cuid port, and trying to figure out how to do the fingerprint.

The JS version is a hash of:

In Rust, we don't have anything like the global object in node or the window object in the browser. So far, I've got:

That gives different fingerprints for different processes & threads generating CUIDs on the same system, but doesn't guarantee anything across systems.

It looks like the Python port uses the system hostname, but that would reduce portability and prevents compiling the Rust to target-independent WASM.

One option that springs to mind is environment variables: the specific env var keys and values available to the process are likely to vary a fair bit across systems. On docker, this will include the HOSTNAME env var, which is generally set to the container ID. This is what I'm defaulting to for the moment, but would be curious to hear your thoughts.

We could also just rely on the random number, process ID, thread ID, and the hash entropy.

ericelliott commented 1 year ago

Be careful with env vars.. how will those be allocated across different environments?

Is generally ok if these values CAN collide across hosts, as long as that is unlikely. In CUID, I often used multiple sources of host entropy to create fingerprints less likely to collide.

mplanchard commented 1 year ago

Hmm, I guess whether env vars are appropriate would depend on what the purpose of the fingerprint portion of the CUID is and when it's intended to vary.

My assumption is that it should be as unique as possible for any given "instance" of a process/thread producing CUIDs. So if I have 10 machines running 10 docker containers, with each container spinning up 2 processes with 2 threads each, I'd expect we'd want 10 10 2 * 2 = 400 unique fingerprints going into the CUIDs, to help ensure that no two instances can ever generate duplicate IDs.

My worry with just including (random number + proc ID + thread ID) + hash_entropy is that the (random number + proc ID + thread ID) seems quite likely to overlap eventually given enough systems. The added entropy from the hash function plus the additional entropy in the CUID inputs may be enough to take care of it, but it seems like it'd be safer to try to include something more system-specific. That said, it turns out env vars aren't available in WASM builds anyway, so that rules them out, unless I use them on non-WASM builds and fall back to something else for WASM.

mplanchard commented 1 year ago

Experimentally, it seems like the random data plus proc and thread IDs will probably generally be sufficient. Can update later if it isn't.