graphile / crystal

🔮 Graphile's Crystal Monorepo; home to Grafast, PostGraphile, pg-introspection, pg-sql2 and much more!
https://graphile.org/
Other
12.59k stars 569 forks source link

UUID v1 is somewhat predictable #578

Closed chadfurman closed 7 years ago

chadfurman commented 7 years ago

https://www.graphile.org/postgraphile/postgresql-schema-design/#postgres-roles

We suggest v1mc in the docs, though as you can see here uuid1 is much less random than uuid4 https://www.sohamkamani.com/blog/2016/10/05/uuid1-vs-uuid4/

benjie commented 7 years ago

Yes, this is why we chose UUID v1. It's extremely expensive for postgres to write an id that is smaller than an existing id because it has to split the pages - this reduces write performance significantly. Because UUIDv1 is prefixed with the date stamp it is (generally) an increasing value and thus fairly safe to use as the PK unless you're expecting hundreds or thousands of inserts per second (and even in those cases the pages are probably still in memory so splitting them is less expensive than cold reads).

Neither option we recommend give you cryptographically secure random IDs; for that I'd recommend using a second column with a unique index and a simple int/bigint primary key, then don't expose the primary key. We do not provide any help in achieving this however 😞

benjie commented 7 years ago

https://www.google.co.uk/amp/www.starkandwayne.com/blog/uuid-primary-keys-in-postgresql/amp/

chadfurman commented 7 years ago

@benjie some of the comments to that article mention a couple of points: 1) Postgres doesn't have clustered indexes, so it's actually not a concern (https://www.postgresql.org/message-id/20151222124018.bee10b60b3d9b58d7b3a1839%40potentialtech.com) 2) There is a CLUSTER command, but it's a one-time sort and so not relevant: https://www.postgresql.org/message-id/30034.1472566554%40sss.pgh.pa.us 3) if you really really really want sequenced IDs (the whole point of a UUID imho is to not have sequential, guessable identifiers), you can mix the randomness of UUIDv4 with a sequential epoch prefix (someone is making a postgres extension to do this)

benjie commented 7 years ago

Interesting, thanks for researching this!

Personally I don't buy that the whole point of UUID is to not have sequential guessable identifiers - the reason UUIDv1 ids include the mac address is to make sure they don't class with other machines acting in good faith; it's not to protect against nefarious actors. UUIDv4 on the other hand is intended to give a truly random non-guessable identifier.

I've filed an issue against the docs site to research this further as Bill Moran does not cite any authoritative sources backing his claims; however his reasoning seems sound and so with the lack of any further evidence I'd definitely support someone removing that particular paragraph.

If you're looking for high-performance seemingly random IDs another option is a Feistel cipher - this will result in less work to generate (no randomness required), less storage required, and faster performance due to significantly smaller indexes (even with bigint you're looking at 8bytes per key vs 16 with UUID). You could do this as part of id generation if you believe that there won't be the fragmentation cost discussed above, or as part of encoding the input/output of the id field otherwise. However UUIDv4 is probably more standard so if you don't require this level of performance it's probably safe to go with that - it's also much more secure due to the amount of randomness contained in the generated identifiers.

chadfurman commented 7 years ago

I'm going to go with uuid v4, myself. Thank you for the talk on this. Will leave this open for you to finish as you see fit.

benjie commented 7 years ago

Closing as it's open on the website repo

hi2u commented 6 years ago

I'm not commenting on this issue as a whole. I just came across this thread from searching for some general UUID stuff.

@benjie But just FYI on UUIDv1 in general, it's a common misconception that they're sequential. They're not. Yeah there's a timestamp before the MAC address, but it's kind of akin to using ss:hh:mm dd-mm-yyyy format (not sequential) instead of starting with yyyy-mm-dd (sequential).

Generate a bunch of UUIDv1s within a few seconds. You'll see that the first 8 characters (time) move very quickly, and the middle parts (date) are the slower ones.

That's why there's alternatives out there like "comb UUIDs" and "UUIDv6" which are similar to UUIDv1 except they've put things in the right order (year first).

No idea why they did the timestamp backwards for UUIDv1, seems silly to me.

Also I agree that the point of UUIDs has nothing to do with guessability. Especially any version that isn't v4. And even then, known issues in various UUIDv4 generators can make them guessable too.

As far as I've read, the order doesn't matter much in postgres, even UUIDv4 is ok compared to other databases. But sequential still feels like a good idea to me, especially seeing you might be also storing copies of the data outside postgres too.

I've made my own version of UUID that is (I think) is a good mix of v1 + comb + v4.

Consists of timestamp (in the right order, I've even made the date human readable), node (not tied to hardware) and also random at the end. I've heard of UUIDv4 (very) occasionally conflicting, so at least with my version the conflict will happen on the same node immediately and fail early, rather that accepting the new data and finding you have conflicting IDs later on when it comes time to merge.

cymen commented 3 years ago

I also ended up here while searching and I think this article is relevant to the discussion:

https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/

But there are disadvantages too – they may make the access patterns much more random compared to traditional sequential identifiers, cause WAL write amplification etc. So let’s look at an extension generating “sequential” UUIDs, and how it can reduce the negative consequences of using UUIDs.