orefalo commented 1 year ago

I read this article and it left me perplex... https://medium.com/javascript-scene/identity-crisis-how-modern-applications-generate-unique-ids-39562736f557

Would be great to put this generator in context..

How does it compare to nanoid, ulid, uuidv7 or cuid?

what's the randomness size? is it sortable?

something like this.. https://blog.daveallie.com/ulid-primary-keys

Can you help add a vertical to this project? https://github.com/adileo/awesome-identifiers

ericelliott commented 1 year ago

Cuid2 is good when you need something secure, extremely collision resistant, and your system is distributed or decentralized (e.g. you want to be able to create records with ids on the client side), or you are building software that may need to scale horizontally).

Its security-by-default approach and flexible parameters (e.g. custom length and entropy) make it a great default choice for just about any identifier use-case.

uuid (v4, the one everyone uses) is weak and collision-prone.
Nanoid has weaker anti-collision and security guarantees than Cuid2 because you have to trust the host’s web crypto API implementation of the “cryptographically secure” random number generator. Until 2018, Chromium’s wasn’t, and it’s not alone.
ulid and uuidv7 (and the now deprecated Cuid v1) leak creation times, which may be a security hazard. They do it to make a faster db primary key. Monotonically sortable ids are faster to look up. Cuid2 intentionally trades a little lookup performance for security. Over 11 year’s experience using Cuid taught us we were optimizing for a little performance gain at the expense of a lot of security.

Total entropy of Cuid2 is 36^(n-1)*26 where n defaults to 24 but can be as much as 32. You can roughly estimate how many ids you can generate before reaching 50% chance of collision with: sqrt(36^(n-1)*26).

V4 UUID has 5.316912E+36 max entropy, but actually uses far less in real life (closer to 5 digits of a Cuid2) because most random generators used to make them are not random enough, leading to billions of real-life id collisions. Cuid2 has up to 4.57458E+49, roughly 13 orders of magnitude larger and defaults to 1.62155E+37 (still many times larger than uuid) and you’ll need to generate ~4.0268498e+18 to reach 50% chance of collision.

Cuid2 also has much better guarantees of using the full entropy range evenly, reducing chances of collision much better than any normal identifier, even when those identifiers use web crypto APIs to source their random entropy. We test this very thoroughly (in parallel on different CPU cores to simulate distributed use) and produce histograms and randograms with each release, to prove it.

It is not sortable, for security reasons. No sortable id is safe from leaking user data.

orefalo commented 1 year ago

Hi Eric, Thanks for the prompt reply,

first, let me share my status,

Could you help me fill the CUID2 column?

orefalo commented 1 year ago

Now, let me try to get context about your comments: I must point that as engineers, we work on facts, not opinions or open statements.

"Nanoid has weaker anti-collision and security guarantees than Cuid2 because you have to trust the host’s web crypto API implementation of the “cryptographically secure” random number generator. Until 2018, Chromium’s wasn’t, and it’s not alone."

this is a valid statement, however the creation of ids typically happens serverside and not on the browser (which would be an even bigger security risk). How does cuid2 solve for this issue?

ulid and uuidv7 (and the now deprecated Cuid v1) leak creation times, which may be a security hazard. They do it to make a faster db primary key. Monotonically sortable ids are faster to look up. Cuid2 intentionally trades a little lookup performance for security. Over 11 year’s experience using Cuid taught us we were optimizing for a little performance gain at the expense of a lot of security.

If I may, can you please be more specific... is CUID2 k-sortable? if not, you are solving for a different use case, let's be fair and point it out in the table above.

"Cuid2 also has much better guarantees of using the full entropy range evenly, reducing chances of collision much better than any normal identifier, even when those identifiers use web crypto APIs to source their random entropy. We test this very thoroughly (in parallel on different CPU cores to simulate distributed use) and produce histograms and randograms with each release, to prove it."

This is great. Have you done a distribution analysis of the other GUID? We know UUID <v5 poorly performs in that matter, what about v7, Ulid, nanoid and co? Trying to get a scientific approach

If you could help me fill the table above, I think it will bring clear perspective of benefits and shortcoming for each GUID. There is rarely one way to skin a cat in Architecture. It all depends on requirements.

orefalo commented 1 year ago

"Total entropy of Cuid2 is 36^(n-1)26 where n defaults to 24 but can be as much as 32. You can roughly estimate how many ids you can generate before reaching 50% chance of collision with: [sqrt(36^(n-1)26)](https://en.wikipedia.org/wiki/Birthday_problem#Square_approximation).

V4 UUID has 5.316912E+36 max entropy, but actually uses far less in real life (closer to 5 digits of a Cuid2) because most random generators used to make them are not random enough, leading to billions of real-life id collisions. Cuid2 has up to 4.57458E+49, roughly 13 orders of magnitude larger and defaults to 1.62155E+37 (still many times larger than uuid) and you’ll need to generate ~4.0268498e+18 to reach 50% chance of collision."

Not an expert, let's make it simple : How many bit of randomness do you use? What's the size of an id generated by cuid2?

Cuid2 intentionally trades a little lookup performance for security. Over 11 year’s experience using Cuid taught us we were optimizing for a little performance gain at the expense of a lot of security.

Mind providing more details. What use case are you referencing? Problem statement, issue and how did you fix it? More generally, why do you see an id generation date as a security leak issue?

orefalo commented 1 year ago

also, feel free to point any issues in the table above. just trying to capture the proper dimensions and data.

ericelliott commented 1 year ago

I don’t have time to address everything right now, but I can start by dispelling a couple misconceptions.

I must point that as engineers, we work on facts, not opinions or open statements.

Yes. When I say that Nanoid, and all other PSRNG-based identifiers that offload their security to the browser or OS are less secure and collision-resistant than Cuid2, that is a fact, not an opinion, because as I mentioned before, browsers and operating systems don’t have a great track record of generating truly random entropy - even in their crypto APIs which are supposed to be CSPRNGs. Also, those entropy sources use pools that can run out of entropy if you hit them with too many requests too fast, which is why Nanoid keeps a buffer. It may be possible to outrun Nanoid’s buffer or run a denial of service attack on the OS random entropy.

Poor random number generators caused the id collisions that inspired the original Cuid in the first place.

Why not use window.crypto to get random values? You have to trust 2 things:

The entropy source. Entropy is notoriously hard to do right. I don’t trust any single source. Hence, the entropy smoothie in Cuid2.
The hash function. Generally, a CSPRNG (“Cryptographically Secure” Pseudorandom Number Generator) needs to hash its entropy, both for security, and for random distribution reasons in order to prevent collisions. But RNGs are generally assumed to be safe and can have bugs for years before they are properly reported and patched. See https://bugs.chromium.org/p/chromium/issues/detail?id=552749 for example.

this is a valid statement, however the creation of ids typically happens serverside and not on the browser (which would be an even bigger security risk). How does cuid2 solve for this issue?

It isn’t true that “the creation of ids typically happens on the server side”. Every major app I have been involved in since 2012 generates a ton of ids client-side, thanks to the distributed capabilities provided by Cuid (and we plan to continue with Cuid2). Since I mentor and advise teams, that’s hundreds of apps. This makes our apps feel more responsive to users because db round-trips are reduced, and allows users to easily work offline.

Generating ids with Cuid2 is not a security risk in the browser because the chances of accidentally guessing a valid existing id are astronomically low, and you can’t update a record you don’t have access to even if you know a valid id (assuming standard server authentication/authorization). Since the data in an id is not used for anything other than unique identification and the character set is limited, it does not pose an insertion attack threat.

Unlike Nanoid, Cuid2 does not need to trust the random numbers it uses because it mixes them with other, guaranteed sources of entropy, and relies on the NIST-standard Sha3 hashing algorithm for random distribution, instead: the most secure, tiny cryptographically secure hashing algorithm we could find.

ericelliott commented 1 year ago

If I may, can you please be more specific... is CUID2 k-sortable? if not, you are solving for a different use case, let's be fair and point it out in the table above.

No, Cuid2 is not k-sortable, but that’s feature, not a bug.

All k-sortable ids are also insecure because they leak timestamp, creation order, or both. Cuid v1 is k-sortable. I realize now, that was a mistake. With modern systems, it is possible to generate k-sortable indexes for any field of any type where you need to optimize lookup performance. Given that, and the ease of adding createdAt db fields, there is no valid reason to leak this information to clients via ids.

ericelliott commented 1 year ago

Not an expert, let's make it simple : How many bit of randomness do you use? What's the size of an id generated by cuid2?

By default, 124 bits (log2(36^24-1*26)). Length is a parameter in Cuid2, so you can adjust for your particular needs.

ericelliott commented 1 year ago

Mind providing more details. What use case are you referencing? Problem statement, issue and how did you fix it? More generally, why do you see an id generation date as a security leak issue?

Sure. One obvious example: In all of Europe, personally identifiable information is protected, including things like user age. If a user has an id with a timestamp, and that id is publicly exposed in any way (either visible in the UX or not) you’re leaking PII, and could be on the hook in a lawsuit.

Worse though, there are now many Web3 protocols where user money is at risk, and monotonically increasing counters or timestamp ids dramatically reduce the unguessable entropy available in an id, by making a huge chunk of the id guessable. This could make things like secret links (e.g. things like Zoom classroom or private company meeting links) brute-forceable.

In Web3, secret links for airdrop participants, tokens for session authentication, etc, could all be put at risk by monotonic ids.

NONE of the monotonic (k-sortable) ids are usable for Web3 applications. This was a big motivator for the creation of Cuid2. We needed to unlock these use-cases.

orefalo commented 1 year ago

Hi, well this is great education, thank you. I am starting to see the details of your implementation and the differences with others I started filling/completing the xls, it's still a WIP. I know your time is valuable, if I may, I will just ask a few question for the items I am not sure about. As far as K-sorted or not, IMHO security goes beyond an ID generator: it's an architecture and should be left for the designers to understand +/-.

Below my current draft.

ericelliott commented 1 year ago

Suggestion: Add a field in the security section called “Cryptographically Secure”.

Cuid2’s output comes from an audited implementation of a NIST-standardized cryptographically secure hashing algorithm (Sha3). It also uses multiple, independent sources of entropy, uses the most input entropy, and fully utilizes every bit of its available input and output entropy. As far as I’m aware, Cuid2 makes the strongest security guarantees out of all the options you have listed.

Anything using the Web crypto API should be at least partially cryptographically secure, though that is not guaranteed, so that might need an asterisk and a link to one or more of the bugs against web crypto API randomness. Nanoid falls in this category. xid gets partial credit for using a CSPRNG, but only for a small part of the id. It also uses a known-insecure hash (MD5) to hash the hostname, meaning it gets zero points for securely concealing entropy sources.

Anything not using well-known cryptographically secure random number or hashing algorithm would fail this test. Examples: include all uuid versions, Cuid V1, etc, though some uuid v4 implementations use a CSPRNG, putting them in the same category as Nanoid.

ericelliott commented 1 year ago

As far as K-sorted or not, IMHO security goes beyond an ID generator: it's an architecture and should be left for the designers to understand +/-.

If your ids leak information, your ids are insecure by definition. See the Principle of Least Privilege.

Because there are other ways to get the performance and features people are looking for in k-sortable ids, they should always use those alternatives, instead, or never expose the id to a client (which will prevent the client from being capable of fetching by id). In my experience, the latter option is infeasible, and should not be attempted.

orefalo commented 1 year ago

"If your ids leak information, your ids are insecure by definition. See the Principle of Least Privilege."

This sounds like a wiiiide open statement, Eric...

In reality, it narrows to data classification.

In that regard, I haven't heard of any lawsuit against a date attached to an id generation and if it even happens… Then, IMHO the ID itself would fall into the same category.

Anyways, thank you for all these details - I will update the xls tomorrow for your feedback

Great discussion, learned many things.

ericelliott commented 1 year ago

Here are all the answers:

Cuid2 features:

Best for: Cryptographically secure, completely opaque, great for distributed systems, flexible length.
Example: ff2249p15jdlryxw2lhv54g8
Number range: -
Bits: Variable. Default: 124, Max: 171
GB: Binary: 16 Strings: 20
Generation speed: Fast - Synchronous, offline (For reference, a very slow id would be a db auto-increment, which requires both asynchronous I/O and distributed coordination to prevent collisions, a slow id would be one that is distributed-safe but requires centralized coordination and async round-trips, e.g. Snowflake. All distributed, strongly unique IDs without central coordination would be “Fast”. Something like Nanoid might be “Very fast” because it also uses a very fast algorithm).
Require coordination: No
Lifespan: N/A

Bit Distribution

Time: 64 bits (ms resolution, hashed into 119 bits)
Sequence/Counter: 64 bits (hashed into 119 bits)
Machine/Process ID: Host fingerprint: Unlimited, ~320 bits (hashed into 119 bits)
Randomness: 64+6 bits + ~124 bit salt (hashed into 119 bits)

DB

Suggested DB Type: varchar(24)
DB Performance impact: About the same as Nanoid, which should have a high impact (because it is a long, random, non-k-sortable id), Cuid would have medium impact because it’s monotonically sortable but also longer than auto-increment, and a varchar instead of a BigInt. Example of low impact would be a monotonically sortable BigInt.
Entropy Sources:
- Initial letter
- Current system time
- Pseudorandom values
- Host fingerprint (Default, configurable: hashed global keys)
- Session counter
Collision risks: Cryptographically secure hash. 50% chance at ~sqrt(36^n-1 * 26) (Default: ~4.7383813e+18 generated ids)
K-sortable: No
Distributed: yes
Cursor-based support is basically the same as K-sortable, so no.
Needs to be configured: No, but it is an option.

Security

Cryptographically secure?: Yes. (Audited NIST standard Sha3)
Is the implementation only based on randomness?: No
Predictability of the ID: Very low (for reference, Nanoid would be low unless there was a serious bug in the CSPRNG, V4 UUID would be high (deterministic next-id calculations are possible in old implementations of JavaScript and PHP, at least) unless it used CSPRNG, Auto-increment would be very high)
Leaks count: No
Leaks machine/process info: No
Leaks the date: No

Shortcomings

Slightly larger than Nanoid (still <5k, gzipped). Intentionally slower than other options in favor of better security.

ericelliott commented 1 year ago

In that regard, I haven't heard of any lawsuit against a date attached to an id generation and if it even happens… Then, IMHO the ID itself would fall into the same category.

Yes, it happens. e.g. duplicate SIM assignments, etc.

There have been € 375,777,219 in fines for "Insufficient technical and organizational measures to ensure information security" (270 fines, averaging ~€ 1.4m) - and leaking data in ids would qualify.

The id would not qualify by itself because a secure id by itself doesn't leak PII.

ericelliott commented 1 year ago

This sounds like a wiiiide open statement, Eric...

The problem with assuming that ids that leak information are safe is that it leaves data wide open for attackers to exploit. In the early days when you could have an application for years that would grow very slowly, you could incrementally add security after launching an application, but for the last decade or so, we've been living in a world where our applications need to be secure by default because attackers are much more organized, and there are a lot more of them. There are professional hacking groups who dedicate insane resources to exploiting every nugget of information your application exposes - so by default, an application should not expose any more information than it needs to do the job.

This is especially critical in Web3, in-app-purchase models, or e-commerce, and it's only a matter of time before all new applications are Web3 applications.

I made the wrong choice when I built Cuid v1 a decade ago. We need to shut down leaking ids for the same reasons that browsers shut down support for unsecured http websites. Of course there are use-cases where it won't matter, but there are many more where it will, and the consequences if you make the wrong choice can be severe (e.g. having significant assets stolen by attackers).

orefalo commented 1 year ago

Morning Eric, All good points - not disagreeing.

I couldn't find the detail of the lawsuit, but from the title, it's about a complete Security Architecture, not a (non PII) date leakage.

Putting your statement in context, assume your ulid user id is 01arz3ndektsv4rrffq69g5fav and that related records were creates within a 1ms range, I still have to go through a 2^80 combinations for each ms/record type. Definitely less than 2^119, but we are not solving for the same problem. Again, it's only an id.. and so many other elements are to be considered to assess proper protection.

Anyway, I am not trying to make a recommendation, rather to provide context - so that engineers can make their own choices based on their NFR

Thanks for all the details and adjustments, Please find the latest draft in attachment - will run it by other authors to get their feedback/corrections.

It was a pleasure. What I am really seeing now is the trade-off between performance and security.

ericelliott commented 1 year ago

Thanks for building the table. Please ping back when you publish.

ericelliott commented 1 year ago

Putting your statement in context, assume your ulid user id is 01arz3ndektsv4rrffq69g5fav and that related records were creates within a 1ms range, I still have to go through a 2^80 combinations for each ms/record type

If the creation time itself is a user privacy violation (e.g. exact timestamp of a medical event, bank transaction, etc) it’s already game over. You leaked sensitive data already.
ulid is not guaranteed cryptographically secure, which means there may be a deterministic algorithm to guess the next random number, which reduces time to crack the next valid id to an 0(1) operation. This isn’t theoretical. Attacks like this actually happened.

More on 2: Imagine you just saw a ulid on a blockchain representing a purchase of a $350k NFT (say, tokenized real-estate). This transaction represents the life’s savings of the user. Now, the next step after purchase is to open a DM encrypted with the purchaser’s public key. This DM contains a token that will transfer governance permissions to the buyer. The token is a ulid, and happens to have been generated as the very next id in the sequence after the transaction record for the purchase. A hacker sees the purchase id on the blockchain, uses the O(1) next id attack to get the token, transfers governance to their own wallet, and then immediately swaps the token with an open order book offer and pockets the money. The buyer is out $300k.

This is a fictional scenario but the mechanics are real. We saw real attacks like this one happen to v4 uuid users, and a user reported a next random number vulnerability in Cuid v1. Luckily, the other entropy sources in Cuid combined were enough to prevent any real losses that we’re aware of, but that was too close for comfort.

Security in layers! In order to be genuinely safe, you need multiple sources of entropy to reduce chances that an attacker can manipulate all of them, and then you need to guard that entropy like Fort Knox to prevent leaking any the attack tools to the attackers in the first place.

ericelliott commented 1 year ago

Again, it's only an id.

The only way to be sure it’s only an id is to hash the entropy with a salted, cryptographically secure hashing algorithm.

orefalo commented 1 year ago

Just did a quick benchmark,

uuidv1 from id128          8,579,550 ops/sec
uuidv4 from id128         20,892,791 ops/sec
uuid v4                   23,112,684 ops/sec
uuid v7                      449,085 ops/sec
nanoid                     5,383,088 ops/sec
cuid                         349,807 ops/sec
cuid2                         51,134 ops/sec
ulid (monotonic)          12,026,748 ops/sec
xid                        3,373,526 ops/sec
ksuid                        541,406 ops/sec

PS: you are missing an index.d.ts

declare module "@paralleldrive/cuid2" {
  export const createId: Function;
}

ericelliott commented 1 year ago

A good balance between fast and slow is a feature of secure hashing algorithms. We are very intentionally NOT the fastest. Password hashing algorithms loop the hash thousands of times on purpose just to make them slower. Likewise, slowing down hashing to make it expensive to attack is exactly what provides security for the Bitcoin network.

If it were really cheap to run Cuid2, you could just rent a huge cluster of GPUs for cheap and run distributed collision and timing attacks to defeat the security.

Cuid2 is fast enough to be unnoticeable to users, and much faster than a server round-trip, but too slow to run brute force, timing, or statistical analysis attacks to guess existing ids or recover entropy from a hash.

If you want to generate lots of ids during render animation frames (weird but ok I guess?) use ulid. If you want to build secure software that still performs really well on the RAIL performance model, use Cuid2.

orefalo commented 1 year ago

Right, somehow I was expecting this comment... ;-)

We all know that the more salt iterations, the better hash "balance" unfortunately puts cuid2 on the far edge.

This finding tend to confirm my assumptions - there is no magic.

it's performance vs security
an id itself is nothing in view of a proper security architecture.

In the end it's all about finding the proper tradeoffs.

ericelliott commented 1 year ago

Knowing the facts is a good start. Knowing which facts are important is even better.

I’ve been building software all my life, and I’ve never seen a situation where the generation of globally unique ids was my bottleneck. Usually, in situations where speed is the most important thing, I don’t actually need global uniqueness, and a simple incrementing counter (count++) is good enough and would dramatically out-perform all of the above options.

If I encountered a situation where id generation was a bottleneck, I’d ask myself some questions:

“Does knowing the sequence or exact timestamp violate user privacy or pose a security threat?” Yes? Nanoid

“Is local, single machine uniqueness good enough?” Incrementing counter.

No? Ulid

bennadel commented 1 year ago

Really fascinating conversation, thanks for all the back-and-forth. I arrived here after I saw that the CUIDv2 wasn't time-sorted like CUIDv1 was; so I was coming in here to ask about that. And, it seems that you have both covered parts of this in great depth (essentially that leaking time-of-creation is in-and-of-itself a security issue). My main concern over this was that if I'm going to use CUIDv2 as a the primary-key in a database, then the clustered index will have to be re-jiggered every time a new record is added. I've never had a non-increasing key before, so I was concerned over the performance of the index maintenance. In your experience, is this index overhead essentially not a problem?

xaevik commented 1 year ago

@bennadel

It is not recommended to use a non-sortable value (CUIDv2) for a clustered index (primary key) as the data in that index is assumed to be sortable. You can do it and some people do (especially with UUIDv4) but it will have some side-effects and increase maintenance.

For our table designs (SQL Server) we usually stick to INT or BIGINT for the Primary Key and store the auxiliary identifier in a nonclustered index or nonclustered columnstore index.

bennadel commented 1 year ago

@xaevik in the context of CUIDv2, however, I am not sure that it would be possible to have a sortable clustered index since, it seems, from what I understand, that the "distribution" of key generation across devices is part of the requirement.

xaevik commented 1 year ago

@bennadel you're correct, it wouldn't which is why I mentioned that it was not recommended to use it for a primary key due to the very nature of how it is generated. Hence why for us we keep the primary key a sortable int or bigint value and the real identifier as a separate column with a nonclustered index.

bennadel commented 1 year ago

@xaevik so, what you're saying is basically that you generate the CUID as the thing the "user sees", but then use an INT/BIGINT behind the scenes when actually doing database IO?

xaevik commented 1 year ago

@bennadel correct, under this pattern the primary key is the identifier for the row, not the entity in question. A quick example would be this:

column	type	is_primary
row_id	int	X
entity_id	varchar(32)
created_at	int

So row_id would be treated as the primary key which is a clustered index. It is strictly used internally for maintaining any foreign key relationships, sorting and join operations. Then the CUIDv2 value would be stored under entity_id where then you can attach a nonclustered index. You also have the added benefit of not needing a unique index for entity_id as well.

You could (if using SQL Server) also create a nonclustered columnstore index instead of a nonclustered index which contains entity_id and created_at if you wish to tie those two datapoints together.

Tables should have as few indexes as possible and should allow for natural sorting (e.g,. the primary key). Any business logic sorting should be done through code rather than through the database. The only exceptions to that would be pagination where server-side sorting is required.

bennadel commented 1 year ago

@xaevik that makes sense to me (though it seems like entity_id could have a UNIQUE index as well, without any harm (unless you're saying that has a performance impact).

xaevik commented 1 year ago

@bennadel I would say that the creation of a unique index is highly subjective to the implementation. There is a performance impact because each time a record is created, hard-deleted, or updated that constraint must be checked and if the dataset is large, it will slow down.

Since CUIDv2 supports variable length creation, the smaller the value, the greater chance of collision. I would say (and @ericelliott can correct me if I'm wrong) you could forgo needing a unique index so long as you are enforcing that all generated CUIDv2 values in a specific table are of all the same length and are between 24 (the default) and 32 characters.

If you plan on being variable or generating smaller (say between 4 to 6 characters) values, then a unique index may be advised so long as you understand the performance impact should the dataset comprise of 10s of thousands or 100s of thousands of rows.

ericelliott commented 1 year ago

TL;DR

Int and BigInt auto-increment indexes are not viable or performant for horizontal growth.

Uniqueness breaks and needs to be coordinated over the network as soon as you add a second db.
Migrating data breaks foreign key relations if you need to switch id types to move to another database

Clustered indexes are not good for record identifiers. If you need to sort by creation time/order, use indexed createdAt fields, instead.

Clustered indexes by definition leak data about the sequence of entity creation, which is not good for user privacy or security if the index gets exposed. If the index does not get exposed, you can't use the index at all for API access, and its only use-case is internal table joins.
By contrast, createdAt fields can be exposed to the API, and if you leak them, those leaks are intentional, rather than a side-effect of leaking the data in a record id.

The performance impact of non-monotonic ids is wildly overstated. Because monotonically increasing ids require coordination when you scale horizontally, that introduces much bigger performance problems at scale than id fragmentation. This deserves a deeper dive.

Id fragmentation requires a little more disk space - 200gb+ for dbs with billions of records. But at scale, we use cloud-native databases designed to perform well in the terabyte range, and often the entire db is stored in-memory with stateless random-access lookup performance.

The concept of "defragging" a database is from the era of physical spinning disk databases where random seeks could significantly degrade performance, causing more physically spinning of a disk to find the correct data. Even in the worst case scenario, the complexity of looking up a uniformly random fragmented db is O(log n) (dbs use structures like b-trees for indexed lookups), meaning if you want to search a billion records, you're looking at ~9 hops, completing in <1ms.

Even in a maximally fragmented index, a keyed singleton lookup over 1 billion records can complete in <1ms on a modern system. When you add massively parallel scale to this, one or two of those lookups becomes a network routing lookup, which you can't avoid even with monotonic ids. That network timing jitter completely obliterates the difference between fragmented and non-fragmented singleton lookups.

The actual db index operations themselves might be 10x faster on a defragged db, but when you factor in network operations, and the cost of coordinating a guaranteed-sequential id across multiple db hosts, db unique constraints across coordinated hosts could actually add 100ms or more to real-world times on inserts.

This network performance cost for unique constraints is a huge reason we created the original Cuid in the first place. Don't undo the benefits by falling back on db auto-increments for the sake of negligible index performance gains.

Remember, our worst case for singleton operations is O(log n). So what kinds of operations suffer from a non-sequential id? Paged, sorted operations. Stuff like "fetch me 100000 records, sorted by id". That would be noticeably impacted, but how often do you think you'd need to sort by id for a randomly generated id? I have never done it.

In real life, we need those kinds of operations for things like "fetch the most recent 100 posts in my timeline" (createdAt) or "sort by cost" (non-unique integer fields in the db).

The reason sorting by id is a good thing is to improve the performance of db insertion, but as I've already explained, coordinating unique constraints for monotonically increasing ids across horizontally scaled hosts is more expensive than just taking the index perf hit on a random id.

Theory is good, but everybody has a plan until they get punched in the face by the real world.

ericelliott commented 1 year ago

Here's how I have done it for the last x years:

Use a cloud-native db solution. Stop trying to wrangle your own SQL servers.
All cloud-native solutions I know of have their own internal keys for things like indexing and table joins. Never use or expose them in your actual application. Use your Cuid2 instead.
All cloud-native solutions I know of allow you to create indexes for any field you want in your table (sometimes up to 3-5 such indexes). Add them if you need to improve performance for lookups by that field.

IGassmann commented 1 year ago

@ericelliott thanks for sharing this. This would make a great blog post that would attract a lot of interest IMO.

ericelliott commented 1 year ago

I added the following summary of Cuid2 differentiating features to the top of the README:

Cuid2 is:

Secure: It's not feasible to guess the next id, existing valid ids, or learn anything about the referenced data from the id. Cuid2 uses multiple, independent entropy sources and hashes them with a security-audited, NIST-standard cryptographically secure hashing algorithm (Sha3).
Collision resistant: It's extremely unlikely to generate the same id twice (by default, you'd need to generate roughly 4,000,000,000,000,000,000 ids (sqrt(36^(24-1) * 26) = 4.0268498e+18) to reach 50% chance of collision.
Horizontally scalable: Generate ids on multiple machines without coordination.
Offline-compatible: Generate ids without a network connection.
URL and name-friendly: No special characters.
Fast and convenient: No async operations. Won't introduce user-noticeable delays. Less than 5k, gzipped.
But not too fast: If you can hash too quickly you can launch parallel attacks to find duplicates or break entropy-hiding. For unique ids, the fastest runner loses the security race.

@orefalo If you publish your table, please open an issue with a link so we can reference it. I'm leaving this issue open so people can easily find and learn from the discussion.

ericelliott commented 1 year ago

@orefalo Mistakes in the table:

Number of characters: Cuid2 has variable number of characters, and defaults to 24.
All of the synchronous id-generators should be at least "Fast", but xid, ulid, nanoid, etc. probably all qualify as "Very fast" because they don't do any extra hashing rounds for security (which also hurts their cryptographic security).
On DB performance metrics, all of the non-cryptographically unique standards (including sequential db identifiers) have very high performance cost for distributed dbs because they need to coordinate unique inserts across the network. All auto-incrementing ids have very low single-db instance costs, which is maybe OK for a blog or a non-networked application, but typically unusable for web-scale applications.
"Entropy sources" should probably fall under security.
Collision risks on same unit should be "Very Low" for both Cuid and Cuid2 - default "High" for UUID (can be mitigated by using a CSPRNG, but the assumption that it does is unsafe). Also high for any other random options that don't use CSPRNG and/or multiple entropy sources.
Series, BigSeries, and V4 UUID all require coordination when scaling horizontally.
Series and BigSeries support cursor-based pagination. Cuid2 does not.
V4 UUID has weak cryptographic security (except in certain implementations which use cryptographically secure random - LOTS of them don't).
Cuid v1 has low id predictability. We upgraded to CSRNG years ago after reports that users could deterministically predict random numbers with the broken RNG in Chrome. Even with the old version, generation would need to happen on the same machine in a tight loop to predict another valid id feasibly, and there were zero reports of that ever happening, even though it was one of the most popular id generators for JavaScript, and used as default ids in some database systems.
Auto-incrementing db ids are not "simple" to scale. (See "benefits"). In fact, under "Shortcomings", I'd mention that they're hard to scale and migrate.

orefalo commented 1 year ago

Thanks for all these comments, I am updating the table, will publish and convert to mark down to ease pull requests moving forward. Will also flip the axis, to help column sort.

@ericelliott, I believe there is still a discrepancy in the CUID2 column. "BIT distribution" shows how bits are layout at rest. cuid2 is 124bit long, 119 of those are used for randomness (via a multitude of hash...etc), what about the remaining 5 bits?

ericelliott commented 1 year ago

The hash portion is 23 characters by default (119 bits). The initial character is also a source of random entropy, but is not part of the hash entropy. It is 6 bits. (That’s the “+6” part of the random entropy). The entire id requires 125 bits to represent. 125 - 119 = 6. Sorry if I made that confusing.

Looks like I made a mistake with the binary representation, though. It should say default: 125, max: 166 (assuming people use a varchar to encode it).

It is not clear what you mean by “at rest”. I interpreted those rows as “how much entropy is available for this information?” e.g. a timestamp represented by 32 bits would have less entropy than one represented by 64 bits (as Javascript’s ms-resolution timestamp is).

This makes more sense to me than trying to show how much of the actual id string is dedicated to that information, because that fails to accurately represent the strength of the entropy sources. Entropy strength is a major advantage that Cuid2 has over other id generators. If you just list all Cuid2’s entropy as “N/A”, that paints a very misleading picture.

Alternatively, you could include entropy bit count in “entropy sources”, hopefully in the security section.

orefalo commented 1 year ago

Sorry if I was unclear,

"a little diagram is easier than a long speech" Napoleon.

For instance: or

Which for cuid2 translates into, 6 bits for "Machine Id / Process Id / Prefix Id" 119 for entropy (aka randomness)

I am aware that the comparison table is not perfect as it is: beyond the size of the ids and their bit particularities lies the algorithm used to generate them. That's something I am not quite capturing at the moment, as it's really implementation specific.

Getting there, will share the updated table soon,

bennadel commented 1 year ago

@ericelliott I much appreciate the deep-dive on database performance. As much as I love to write SQL, I don't know all that much about the low-level database mechanics, so I just try to piece together what I've been told over the years. Sounds like some of my concerns are not longer relevant on modern implementations of database software.

ericelliott commented 1 year ago

@orefalo Yes - I would call that "id layout", maybe, and those diagrams are way better than text. What did you use to draw those diagrams? They're perfect. 💯

If you're going to do it that way, adding entropy source bits in the security section would be good, to avoid people getting the wrong idea about security from the id layout.

ericelliott commented 1 year ago

@orefalo @bennadel I added a detailed comparison section to the documentation, including a brief discussion on the issue of K-Sortable ids. Please read it and let me know your thoughts.

orefalo commented 1 year ago

Hi Eric,

I take no credits, I used my best friend google to find the gfx above, not planning to go that far ;-/ Markdown won't make it, the data is too wide. Here is the latest version https://docs.google.com/spreadsheets/d/1ZsXBH0z7GOJv3N69QbEDKBZt8IeE0CfRI9vhihV4teo/edit?usp=sharing

I also became more knowledgeable about pseudo random generators. Turns out it's a platform/implementation concern - On Linux for instance, the kernel gathers noisy data from various devices and transfers them to an internal pool of entropy. In that context... why is cuid2 more secure, than say a nanoid which uses /opt/random?

I read your statement, I like the intro, I agree with some, disagree on many - mainly because it's open statements with no facts (trying to help).

I am not aware of any breakthrough in B-tree indexing: distributed or not, the problem remains: db indexes perform better on ordered data. I don't see how pushing the problem to a cloud vendor can be used as an argument.
Regarding k-order, I am used to seeing application queries such as :
- get last orders
- get last payments
- get last interactions
- get changes of the day for backup
- ....etc
- We even use time to bound query results and optimize IO.

All of the above are time bounded, aren't they?

I agree with you - encoding a date in an identifier implicitly makes it less secure because it reduces the entropy size... for the benefit of speeding indexes. But wait... adding an indexed creation_date column to a DB also has a cost. Especially in distributed cloud db → compute, storage, io, network calls.

The conclusion is that picking a GUID is heavily dependent on context and NFR. It is not a one size fit all. (IMHO!)

Cheers,

PS: I was thinking distributed DB and the time bounded access patterns above
k-ordered identifiers are starting to make sense - they implicitly optimize data affinity is distributed db

ericelliott commented 1 year ago

Cuid2 is more secure than the web crypto API because it does not need to trust the random entropy source (which is not guaranteed to be good or have enough entropy), nor does it need to trust the browser’s choice of hashing algorithm (which has historically been proven not cryptographically secure in Chronium and took 3 years to fix after being reported).

We don’t need to trust those things because we supply our own uncorrelated entropy, mixing it with browser random entropy (moving to cryptographically secure rng as well, very soon), and we hash with a security audited, NIST standard cryptographically secure hashing algorithm.

I am not aware of any breakthrough in B-tree indexing: distributed or not, the problem remains

That simply is not true. (Technical deep dive here).

The breakthrough is not in the indexing algorithm, but the hardware it runs on. Every cloud provider I know of uses solid state machines with RAM-like or actual RAM access times, sometimes on custom server hardware with custom integrated wide-bandwidth, ultra low latency data buses. Even on a billion records in the most fragmented index, you’re looking at 9 hops worst case and sub-millisecond seek times. Concerns about fragmented index performance are from the days of spinning disks. Those days are long gone in cloud infrastructure.

As I mentioned in the documentation, sequential ids can even cause real performance degredation because sequential inserts cause the btree to unbalance quickly, causing hotspots and rebalancing churn.

Those same hotspots can also cause concurrency locks that slow down writes.

Further, NONE of the use-cases you mentioned should be done by id. Ids should be treated as opaque values by applications. What you’re looking for is createdAt.

It costs me less than $50/month to host hundreds of GB cloud databases - which is large enough for hundreds of millions of user records. Adding a createdAt field to every record has virtually zero notable cost.

ericelliott commented 1 year ago

@orefalo The sheets link you shared is private.

ericelliott commented 1 year ago

I posted a Twitter thread with links to a few exploits of the vulnerabilities that motivated Cuid2.

bennadel commented 1 year ago

@ericelliott Thanks for adding the section about sorting, I think it makes sense. The only hesitation that I have is the emphasis on "cloud native" database solutions. Only because, I wonder how many people aren't actually in that world yet, perhaps running their DB and PHP and Apache all on a single box (for example). Though, I suppose, at that volume, maybe it doesn't much matter anyway.

I just think of all those poor companies that tried to do Microservices simply because that's what Netflix was doing; and ended up failing because they didn't actually have the need / technical expertise / same problems. Not throwing shade at microservices (or CUID) - only saying that people often enter into a solution space without actually understanding whether or not it applies to them.

That said, this is all very cool and I intend to look closer at the implementation 💪

xaevik commented 1 year ago

Only because, I wonder how many people aren't actually in that world yet

@bennadel we're in that group, we still run Microsoft SQL Server via PaaS for numerous architectural reasons. That is the reason behind the conversation we had earlier with the row_id design as the primary key which is what cloud architected database solutions do, except that you cannot see it.

bennadel commented 1 year ago

@xaevik 🙊 shhhh, I'm in that world too 😉

paralleldrive / cuid2

Comparison to others? #7

Here are all the answers: