Comparison to others? - Githubissues

paralleldrive / cuid2

Next generation guids. Secure, collision-resistant ids optimized for horizontal scaling and performance.

MIT License

2.67k stars 54 forks source link

Comparison to others? #7

Closed orefalo closed 1 year ago

orefalo commented 1 year ago

I read this article and it left me perplex... https://medium.com/javascript-scene/identity-crisis-how-modern-applications-generate-unique-ids-39562736f557

Would be great to put this generator in context..

How does it compare to nanoid, ulid, uuidv7 or cuid?

what's the randomness size? is it sortable?

something like this.. https://blog.daveallie.com/ulid-primary-keys

Can you help add a vertical to this project? https://github.com/adileo/awesome-identifiers

ivanfilhoz commented 1 year ago

I've just read Eric's article presenting CUID2, and I'm carefully watching how it evolves.

If your app is a blog sitting on a LAMP stack, you're far from needing to bikeshed on ID generation. If you're not horizontally scaling, simply use UUIDv4 and call it a day.

The discussion should assume horizontal scalability is something you care about (although I don't see any drawbacks to using CUID2 anyway, it's even prettier and URL-safe). A few points already made:

Timestamps as a source of entropy end up leaking information unnecessarily;
Any index overhead is negligible compared to the benefit of safely generating your ID directly from your client application, such as a SPA or mobile app. This is faster than any network round-trip, period.
Last but not least -- and that's the research I'm taking as homework for the next week -- the Web Crypto API is not safe enough yet. That is circumvented by the multilayer approach CUID2 uses. @ericelliott, please correct me if I got this part wrong.

ericelliott commented 1 year ago

@bennadel unless you're running it on a 12 year old machine in Grandma’s closet, chances are really good you’re using solid state drives and your hosting costs are in the pennies-per-gigabyte range, and you’re also not dealing with enough rows for sparse rows to be problematic. In other words, the worries about the performance impact of random ids don’t apply to you.

If you’re too small to scale horizontally, random ids won’t be a problem. If you’re big enough to scale horizontally, random ids will make it easier. Either way, in 2023, just use a random id format, and it should be secure by default for the same reason you wouldn’t put up an HTTP server in 2023.

It’s safe to assume:

Even the slowest path will give you good performance.
Auto-increment is not secure
Sequential ids are not secure
You’re not going to get in trouble using Cuid2

Relying on the DB for collision resistance won’t be a good plan because:

It’s not future proof - if you ever DO need to grow - scale horizontally, or upgrade your database - you’ll need to change ids - are you going to lose all your table join references and have a hard time migrating? Or you can use Cuid2 and you won’t need to change them.
It’s slower for clients - you’ll have to wait for server round-trips to create records. That means every insert has a perceptible ~100ms+ delay. Throw in an extra 100 - 200ms if the client is in China and the server is in New York.
Your app will not function offline. No new record creation without complex client-side juggling, introducing lots of complexity, confusion, and potential for bugs.

ericelliott commented 1 year ago

@xaevik

That is the reason behind the conversation we had earlier with the row_id design as the primary key which is what cloud architected database solutions do, except that you cannot see it.

No, it is not, because those systems will have a high degree of horizontal scaling, they can’t just increment a row-counter without slowing down the system dramatically for coordination, so most of them create references in a way that identifies a shard, and then uniquely identifies the new record in that shard. Something like {shardId}{random/sequentialRandom}.

The exact algorithm for both will vary a lot. Because those refs will work differently from one solution to another, your best bet is to use your own ids as the canonical reference so you can easily migrate to other solutions if you need to, and to prevent leaking sensitive information to hackers because those solutions rarely spend much thought on being unguessable (e.g. They may not use cryptographically secure methods or may employ incrementing counter strategies and/or time stamps for collision resistance).

xaevik commented 1 year ago

@ericelliott appreciate the correction on that information, thank you.

ericelliott commented 1 year ago

@ivanfilhoz

If you're not horizontally scaling, simply use UUIDv4 and call it a day.

No. UUIDv4 implementations often use untrustworthy random sources. You should consider it insecure in favor of a format that was implemented from the ground up with cryptographic security in mind. Just use Cuid2.

Your other points are correct:

Timestamps as a source of entropy end up leaking information unnecessarily;

Any index overhead is negligible compared to the benefit of safely generating your ID directly from your client application, such as a SPA or mobile app. This is faster than any network round-trip, period.

Last but not least -- and that's the research I'm taking as homework for the next week -- the Web Crypto API is not safe enough yet. That is circumvented by the multilayer approach CUID2 uses. @ericelliott, please correct me if I got this part wrong.

The web crypto API still hasn't replaced SHA-1 even though it's unsafe and even though SHA-3 has been available for years. SHA-512 is in the spec, but I prefer to use a hash formulation that has not been called into question. By the time the standards got into the spec, they were already obsolete! 🤦‍♂️

I have other concerns about the web crypto API that are detailed in the docs.

orefalo commented 1 year ago

Public link. ;-) https://docs.google.com/spreadsheets/d/1ZsXBH0z7GOJv3N69QbEDKBZt8IeE0CfRI9vhihV4teo/edit?usp=sharing sorry about that, need to incorporate changes based on the interactions above. ETA EOD

ericelliott commented 1 year ago

Closing because we have added a comparison section to the documentation.

@orefalo Ping here if you publish your chart publicly, write a blog post, etc.

orefalo commented 1 year ago

Hi there, I have since published a medium post on the topic.

https://medium.com/@orefalo_66733/globally-unique-identifier-a-fair-comparison-12b114c78ead

I will eventually add new implementations I found in the wild...

Was a great discussion and learned a bunch, thank you

CanRau commented 1 year ago

I really enjoyed reading the discussion, although auto incrementing DB IDs are still confusing as the Planetscale team some of which have created Vitess (and still maintain) for YouTube and worked at GitHub still recommend auto incrementing bigint IDs

References

https://github.com/planetscale/discussion/discussions/385#discussioncomment-4899847
https://twitter.com/mscccc/status/1632766611642802176?s=20 @msccc Mike Coutermarsch works at Planetscale
https://twitter.com/LewisJEllis/status/1633222835090132992
here they explain drawbacks of sequences and also how they work in Vitess https://vitess.io/docs/15.0/reference/features/vitess-sequences/

Also because of my math skills I'm still not sure how to calculate collision probability, tried

Math.sqrt(36^(23-1)*26) === 23.15167380558045 // default length

Math.sqrt(36^(4-1)*26) === 10.295630140987
// according to readme "50% chance of collision after generating only ~1101 ids"

in JS, though I've no idea (yet) how to interpret those results 🥲

Update thanks to Bing Chat I think I got the function "converted" to JS

Math.sqrt(Math.pow(36, 24 - 1) * 26) === 4026849817824303000
Math.sqrt(Math.pow(36, 4 - 1) * 26) === 1101.3882149360415

which looks pretty much like the examples in the readme 🤓

Math.sqrt(Math.pow(36, 8 - 1) * 26) === 1427399.1265571099

would mean generating ~1427399 so ~1.4 million ids would mean 50% chance of collision, am I getting this right?

So comparing those results to the nanoid calculator (removed - & _ from alphabet) nanoid seems to be able to generate way more ids with only 1% of collision probability?

Sorry for my math skills, trying to understand this better, also not necessarily saying cuid2 is bad for this reason, I'm just curious and this is all very interesting to me.

ericelliott commented 1 year ago

First, the Vitess documentation points out the security considerations of auto-increment ids, but they are understating the dangers.

The documentation also mentions that a random id solution is a viable alternative. You can think of Cuid2 as a random generator solution that doesn't trust the built in browser random features, and instead uses a cryptographically secure hashing algorithm to hash entropy that has been proven to work at massive production scales for more than a decade.

Interestingly, PlanetScale's API uses NanoId, which is random, not auto-increment. Cuid2 is a good replacement for NanoId in most situations.

As for the JavaScript attempt to figure out the math, what you want is Math.sqrt(36**(4-1)*26) which evaluates to 1101.3882149360415. JavaScript uses the ^ operator to mean bitwise XOR, not exponent.

The default length calculation would be Math.sqrt(36**(24-1)*26) (note: 24-1, not 23-1, which makes a big difference, and the ** operator), and the result of that calculation is 4026849817824303000 or 4.0268498e+18 as I mention in the documentation.

So comparing those results to the nanoid calculator (removed - & _ from alphabet) nanoid seems to be able to generate way more ids with only 1% of collision probability?

The probabilities you're looking at in the NanoId calculator are not comparable to the birthday paradox calculation we're using here. To do apples-to-apples comparisons between NanoId and Cuid2 entropy, use the formula Math.sqrt(36**l) for NanoId calculations, where l is your desired character length.

It also sounds like you're misinterpreting the results. One of the differences between Cuid2 and NanoId is the character set. NanoId does not guarantee identifier-safe ids by default, meaning you'd need to pass it a custom alphabet to use it in some of the places where Cuids can be safely used (e.g. html classnames), which will change the number of characters you'll need to reach your collision-resistance goals. Since NanoId uses a 64-character alphabet by default, and we use a 36 character alphabet by default, and NanoId doesn't protect against invalid identifier creation, it can store more entropy in fewer characters than Cuid2, but that comes with its own set of tradeoffs.

Keep in mind that both NanoId and Cuid are customizable, which allows you to adjust those calculations for your particular use-case. If you could use NanoId at 8 characters, and you want a Cuid2 with the same or better odds, use Cuid2 with 10 characters instead of 8, which will give you 3x better collision resistance + an identifier-safe id structure + the assurance of using well tested entropy with a cryptographically-secure hashing algorithm (the same one used to secure the Ethereum blockchain). You don't get those assurances with NanoId.

orefalo commented 1 year ago

Allow me to provide some context for the benefit of anyone reading this,

Random generators

You have two choices, each with their own risks and benefits, starting with considerations such as how many people have reviewed the implementation to how widely it is used or is the random implementation cryptography secure.

Trust the cuid2 implementation of randoms.
Trust the runtime implementation of randoms.

The bottom line is to be aware and not to trust blindly.

Alphabet

As rightfully pointed, this is a non-issue because it's customizable. Nanoid was built to support JavaScript and browsers, so its encoding is optimized for use with URLs. This is not a bad default choice, in my opinion.

GUID vs. sequences

When dealing with OLTP (Online Transaction Processing) applications, you typically need to focus on security, resilience, scalability, and speed. However, when building an OLAP (Online Analytical Processing) database, the above requirements are less critical.

Sequences are easy to predict and are well indexed.
GUIDs add randomness to minimize the attack surface.

I found this article fascinating: https://brandur.org/nanoglyphs/026-ids.

Once you have considered these factors, the question becomes whether to use ordered or pure random GUIDs.

There are security aspects to consider, such as

entropy (randomness) quality, we mentioned it
dispersion: cuid2 claims to be better in these areas. no evidences.
opacity of the id - if you can live with a "visible" ID generation date or not

There is also the persistent side to consider. Since IDs are typically used as indexes, their randomness can have a negative impact on DB storage and IOs (the Write Ahead Log effect). Eric mentioned that there were "innovations" in data-centers, but please realize that IO and storage do have an impact on cost.

In conclusion

Security always comes at a cost.
Be aware of your specific requirements and choose an implementation accordingly

ericelliott commented 1 year ago

dispersion: cuid2 claims to be better in these areas. no evidences.

We prove dispersion with multiple kinds of tests with every commit:

histogram
character frequency test
Visual randograms to spot check for patterns (there shouldn't be patterns - should look like random noise)
collision tests - we test id generation in parallel with more than 11 million ids on every commit, in total, we have tested several billion ids for collisions with zero detections

Those tests and evidence are available in the code, and the most recent randogram and histogram are present at the end of the documentation. AFAIK it's the most thorough test suite in the id-gen ecosystem.

As for our entropy sources, they're strengthened versions of the entropy used successfully for more than a decade in Cuid v1, for which there were zero confirmed collision reports in spite of being used in production in products with hundreds of millions of users.

I addressed performance considerations earlier in this thread. TL;DR, it's not a big factor with today's tech.

orefalo commented 1 year ago

Morning,

Sorry, I didn't mean to say there were no evidences of cuid2 distributions: there are, and in great details.

There is a lack of evidence that the distribution is better than other implementations. Put differently, is it a valid selling point?

Eric, please realize I am not trying to criticize your GUID implementation: just trying to provide the proper 360.

entropy sources - fine, inline with my comment above. Not sure if there are metrics which can be used to assess quality. In any case, I would assume Linux would have better quality as it can tap low level random sources.
performance - fine. Indexing a pure random guid translates into IOs and storage, which translates into cost. This is not limited to cuid2, any random will have the same effect. Therefore, an alternative design pattern consists in using a time based guid to minimize this effect. Of course, just like anything in Architecture - it comes with side effects.

Sincerely,

ericelliott commented 1 year ago

Both of those assertions are false. In the section on NanoId & Ulid we link to a browser bug that completely invalidated cryptographic security in the "cryptographically secure" random number generator in Chromium - meaning ALL guids/ulids/nanoids generated before the patch are potentially vulnerable to attack. Had they been generated with Cuid2, they would not be. Similarly, there are links in the docs to actual vulnerabilities caused by V4 UUIDs. We don't just show that other id implementations have poor entropy - we link to real examples of exploits in the wild.

As for the assertion that there is "a cost" associated with random ids, while on the surface, obviously theoretically true, is still a very misleading statement which turns out to be false when you consider what happens in real life.

The cost of random indexes - in the age when cloud db storage is pennies per GB and cloud db servers have tens or hundreds of GB memory per host by default - is so tiny as to be completely drowned out by other choices, such as whether or not to optimize your images.

Relative to the large cost of dealing with auto-incrementing ids or worrying about security fallout, the cost slips into positive ROI very easily.

You mention the small cost of random without also mentioning the large cost of the alternatives. That doesn't make sense.

To put costs into perspective, for a user table, you'll pay an engineer more to research the performance issue than you would pay in the performance difference each month even if every internet user in the world used your application.

ericelliott commented 1 year ago

@orefalo I added more context and links in the "why" section of the documentation to make the evidence that Cuid2 is needed more obvious.

orefalo commented 1 year ago

Morning Eric,

I read the "why" section and can't disagree.

Yet, there is something I am missing....

Why do you put so much accent on the browser for ID generation? Channels are insecure, aren't they?

I always generate IDs close to the systems of records, server side.

Authentication -> In an IAM or IDM
Session -> At the edge, typically in the web application
Records id -> in the microservice or the db (depending on)

Now if 99% of ID generation use cases are server side, wouldn't it be better to leverage a CSPRNG who has access to proven sources of entropy? (IO, kernel events..etc)? rather than an opaque implementation coming from the ages of cuid1?

Regarding the "cost" of indexing random UUID in RDBMS, I did an extensive research and couldn't find any scientific assessment. I mean people talk about the negative impact of the WAL effect, but the articles are operational not comparative.

Maybe there is space to think about a new cuid3 implementation with support for:

an optional variable size timestamp [0-x]
a plugin architecture for entropy generators

Cheers,

ericelliott commented 1 year ago

Why do you put so much accent on the browser for ID generation? Channels are insecure, aren't they?

You are right not to trust the client to generate ids that would have system security implications. However, a lot of id generation is safe in the client, assuming all other ids are generated with Cuid2, and assuming that the client is not already compromised by a malicious third party (in which case, that particular user's security already has bigger problems).

Now if 99% of ID generation use cases are server side

This is not a safe assumption, for all the reasons mentioned in the Cuid2 documentation. Some of those requirements include offline operation, which is a non-starter if your id generation relies on server-side access. Remember, there is more than one kind of application in the world, and we still need dependable, distributed, collision-safe, opaque id generation. The use-cases are many. (e.g. most consumer productivity applications).

You are right that authentication / session ids usually need to be handled by the service doing the authenticating, but increasingly, that is happening on the client side, too (see decentralized Web3 authentication where the authority is a self-sovereign, client-side wallet).

wouldn't it be better to leverage a CSPRNG who has access to proven sources of entropy? (IO, kernel events..etc)?

No. Where do you think the buggy browser implementations of CSPRNGs got their entropy? We don't completely trust these sources, or we wouldn't need Cuid2 at all. We'd just put a friendly wrapper around crypto.getRandomValues() and call it a day (and indeed, that's exactly what NanoID does - this is addressed in the documentation).

rather than an opaque implementation coming from the ages of cuid1?

The entropy sources are not opaque, only their final combination as an identifier is opaque, and their reliability as a unique combination of entropy has been proven for more than a decade in some of the largest scale applications in the world, with hundreds of millions of users. In crypto security, age and proven reliability are requirements. It takes time and eyeballs to prove algorithm security.

Those entropy sources are so trusted that have been tested in dozens of id generating libraries for many years and proposed in the UUID V6 - V8 specifications.

Maybe there is space to think about a new cuid3 implementation with support for:

an optional variable size timestamp [0-x]

No need. If you don't need the opaque security features of Cuid2, what you need is probably UUID V7 (which is basically a standardized version of Cuid v1). It is my assertion that in most cases, this would be a mistake, for the same reason running an http server would be a mistake.

a plugin architecture for entropy generators

Cuid2 already accepts an entropy generator as a parameter. You can pass in any compliant random() function implementation.

Thanks for your thoughtful questions.

orefalo commented 1 year ago

"safe" in terms of security has a different meaning that "safe" in terms of entropy

Listen, this is comment 68 on a closed topic. While I don't share all of your arguments and positions, I mostly agree and now clearly see where cuid2 fits in.

More importantly, I appreciate the time you took to clarify my doubts and provide references whenever possible. Through this conversation and my own research on GUID, I've learned a lot.

Best regards,

orefalo commented 1 year ago

I like facts, I hate open statements..

Here is an article which seems to be confirming my suspicions about fully random UID generators.

https://www.cybertec-postgresql.com/en/unexpected-downsides-of-uuid-keys-in-postgresql/

Enjoy the read,

ericelliott commented 1 year ago

@orefalo That article is basically reiterating what I said above about monotonically increasing ids. The issue has been well known to me for well over a decade, and inspired the design of the original Cuid (And subsequently UUIDV7 which cites me and CUID as a reference). More than a decade of practical use-cases has taught me that it is almost never a real problem in practice, and if and only if it becomes one you can simply designate a different primary key for your database (for table joins/db references) and continue to only expose opaque ids to clients.

Don't burn down your front door because unlocking it takes a few extra seconds.

orefalo commented 1 year ago

You know how much I enjoy arguing with you, right ;-)

I agree with the early points, the article however goes on and provides some interesting insights about the side effects of using UUID as primary keys - namely, page grow which with data locality seems to have a significant negative impact.

On these points, I am not arguing - in the end it's security vs speed.

However, I find the conclusion even more stunning "so if you want to use UUID’s, try to use a sequential variant. UUID v7 is a good option"

Seems like I am not the only one to believe a cuid3 implementation with a optional variable size timestamp [0-x], would solve this long running debate.

As far as using two keys - one external, the other internal ; it doesn't solve the problem as both will need to be indexed to avoid (distributed) table scans.

Sweet dreams

ericelliott commented 1 year ago

I agree with the early points, the article however goes on and provides some interesting insights about the side effects of using UUID as primary keys - namely, page grow which with data locality seems to have a significant negative impact.

We have different definitions of "significant". I've already explained this several times, but I'll dig into more depth:

The example is manually enumerating 10 million records. This is guaranteed to hit every record in the table every time you run the query. In a production system where that would be a frequent query, you can just cache the count and get sub-millisecond response times on this lookup. Problem solved without messing with IDs or messing with the DB configuration, but read on... there's more.
Typical DB usage does not hit all your records, or even most of your records, but only a few records or a few dozen records in the case of complex queries. Assume your typical query hits 100 records. The UUID V4 version would complete in 0.014 ms - this is not a bottleneck for typical applications and in the last decade, I have never seen this issue cause an actual problem in an app I worked on (I did before SSDs, on 15-year-old-hardware, but we just don't see those problems anymore). As I mentioned before, if you actually DO encounter this issue - just make a monotonic, indexed createdAt field or internal monotonic db primary key for joins that you don't expose to the public, and call it a day. In other words, in the very unlikely case that this is a problem for you, there are good workarounds that don't involve tossing security out the window.
You can fine-tune a DB to optimize for SSD cloud services with hundreds of GB of memory (which is the typical case for scalable cloud situations). Such systems need to tune the db so shared_buffers is ~25% of the total system RAM (which may be dozens or hundreds of GB in a typical cloud DB situation, and app developers are rarely exposed to those parameters at all because services like FaunaDB do it for you). For reference, that is typically enough to store small record data for every user of the internet in the world for less than $100/month. DB admins who are actually worried about these performance implications can compile a specialized build of the db to optimize or even auto-tune the page size to better utilize available RAM and reduce shared_buffers lookup. These optimizations would be good for DB performance whether you use random ids or not.
If you don't have enough data or traffic to be worried about automated horizontal scaling (see the point above), you don't have enough data or traffic to worry about this issue at all, because you will likely not be hitting 10 million+records on a regular basis. Typical primary key lookups (even for random ids) use a b-tree lookup, which is a O(n) operation that completes in sub-millisecond times. In other words, what you're looking at in that comparison is a worst-case lookup - not typical db usage.
If you actually DO have usage that hits tens or hundreds of millions of records per operation, you probably don't want a DB, you want a cloud GPU or TPU cluster.

Mic drop