Support for comb v4 UUIDs

kael-shipman commented 5 years ago

Hey there,

Wondering if you have any thoughts on supporting comb v4 UUIDs. Doesn't seem to be a ton of information out there about this, but there seems to be general agreement that it delivers potentially massive increases in database performance at the relatively minor cost of decreased uniqueness guarantees.

I'd be willing to take a shot at implementing this, but just want to know if it would be a welcome addition or not first.

broofa commented 5 years ago

For posterity:

The Cost of GUIDs as Primary Keys], Jimmy Nilsson, March, 2002 (Original article describing "comb" uuids)
A PHP implementation

After reading Nilsson's article, I don't see an immediate need for "combs" here. In a nutshell:

"Combs" were an optimization for an outdated version (circa ~2002) of a specific DB product, MSFT SQL Server.
~~RFC4122 "Version 1" UUIDs probably serve the same purpose as "combs". Both forms of id feature a timestamp in the low-order bytes, which appears to be Nilsson's key finding.~~ [Edit: Not so much. See felix's comment, below.] (FWIW, RFC4122 was published ~3 years after Nilsson's article, and is heavily influenced by MSFT so it wouldn't surprise me if SQL Server perf were one of the main drivers behind the V1 UUID format.)
~~Most modern DBs support UUIDs natively. It wouldn't surprise me if they're optimized to deal with them efficiently.~~ [Edit: Looks like this may still be an issue, at least with Postgres. See comment below]
~~DB performance just isn't as big an issue as it was 15+ years ago.~~ [Edit: Wrong again. 😛 DB perf is better, but data scale is as big/bigger issue than before]

kael-shipman commented 5 years ago

Very well argued points :).

broofa commented 5 years ago

https://github.com/ramsey/uuid/issues/53#issuecomment-320658465 points to this benchmark that shows a ~30X increase in INSERT time for Postgres 9.6 when using V4 UUIDS.

However, the V1 perf in that test supports my suggestion, above, that V1 UUIDs are an acceptable substitute for combs.

kael-shipman commented 5 years ago

For me, v1 UUIDs pose an issue because of their guessability. I know there's some back-and-forth right now about the benefits (or not) that you get in having randomized, unguessable keys, but at best, it looks like the jury's still out. Thus, for my purposes, the most important criteria are

near-guaranteed unique
unguessable
efficiently indexed
easy-to-generate

V1 UUIDs fail on the second requirement, but hit all the others. V4 UUIDs fail on the third requirement, but hit all the others. CombV4 UUIDs are, to me, the perfect blend.....

But again, I accept that you're not interested in incorporating this functionality. I'll likely extend your library to add it when I get time to do some research on how, but for now, I'll just use standard V4's.

broofa commented 5 years ago

FWIW, this module sets the node randomly. That field will be the same for ids generated by the same process, but you could randomize the value each time if that were a concern, thusly:

> const crypto = require('crypto');
> const uuidv1 = require('uuid/v1');
> 
> uuidv1({node:crypto.randomBytes(6)});

'58cabe40-2672-11e9-9dff-aaffbcc2dc6a'

Edit: This is RFC-compliant, btw. While the node field is nominally supposed to use the host MAC address, the RFC allows for random values as long as the multicast bit is set, which node-uuid does.

felixge commented 4 years ago

I'd like to give some more insights on what's going on with PostgreSQL when indexing UUID values.

Like most relation databases, PostgreSQL defaults to using a B-Tree data structure for indexes. Below is an example B-Tree from the linked article showing integer values being indexed.

Each B-Tree node (the dark grey boxes in the graphic above) in PostgreSQL is a 8 KB page that can contain a variable number of keys. If you're indexing UUIDs, this means you can theoretically cram up to 512 (8192/16) uuids into one node (page). In reality it's quite a bit less, because each entry in the B-Tree is also containing pointers to rows (aka tuples) into the tables (aka heap) and to other B-Tree pages. For simplicity, let's assume 100 UUIDs fit into a page.

Now let's imagine an indexed uuid table with 1M rows. Assuming our leaf nodes are 50% full (i.e. containing 50 UUIDs each), this mean we'll have 20K leaf nodes.

And now, let's say we want to insert 20K new v4 UUIDs. Given the random distribution of those values, on average we'd expect each UUID to be inserted into a different leaf node, so we'd end up with 51 UUIDs on each node. Eventually PostgreSQL has to write this data to disk (via checkpointing, but that's another topic), and this is done by writing out each page that has been changed in full. In our case that's 20K*8KB= 156.25 MB.

You might protest saying that it's really stupid to write the full 8KB of each page, when only 1% of it (~82 bytes) has changed. So yeah, in theory PostgreSQL could make much smaller writes, but in reality that would be even worse because SSD disks have a minimum write size of usually 4KB, and trying to issue a 82 byte write would actually force the SSD to first perform a 4KB read, then update the 82 bytes you're trying to write, and then issue a 4KB write to perform the update, which would likely double the latency of the overall operation.

So as you can see, using random UUIDs is pretty much the worst case scenario for any B-Tree implementation and might cause over 100x write amplification. And from my personal experience running a few TB-sized PostgreSQL instances using v4 UUIDs, I've seen those kind of write amplifications in the wild.

If our indexed values instead would have been sequential, we might simply have allocated 200 (20K/100) new pages as our values wouldn't have been randomly distributed across the nodes. And writing those pages would have only taken 200*8KB = 1.56 MB.

So yeah, while modern DBs might optimize a few things (e.g. storing UUIDs in binary rather than as text), random UUIDs will always present a worst-case scenario for B-Trees.

And sure, some databases (e.g. LevelDB, Cassandra, RocksDB) use LSM-Trees rather than B-Trees, but they're basically just giving you higher write throughput at the expense of read throughput, so they're unlikely to replace B-Trees in the near future. But I don't have much LSM experience, and I'd have to do some more research on how they'd perform with the scenario outlined above.

Oh, and last but not least, here is my attempt to reproduce the benchmark linked above: https://gist.github.com/felixge/64086c200ad6dc03aadd8b2900434c7a . In my testing v4 UUIDs are only 2x slower than v1. I suspect this is because the disk is not a significant bottleneck on my system, and the CPU overheads dominate. We're also inserting as fast as possible, so we probably end up hitting the same leaf nodes more than once between checkpoints, greatly reducing the write amplification. Oh and my shared_buffers are set to 4GB (default is 128MB), so the original benchmark wasn't able to amortize writes across checkpoints much at all.

broofa commented 4 years ago

@felixge Thanks for taking the time to provide such an insightful comment. Very helpful!

Reopening so as to encourage further discussion.

broofa commented 4 years ago

@ramsey points out that my previous comment about v1 serving much the same purpose is mistaken, which I'm happy to concede in light of @felixge's test, above.

That said, my concern with COMBs is that "time-ordered v4 UUIDs" is an oxymoron. Version 4 UUIDS are "meant for generating UUIDs from truly-random or pseudo-random numbers" [rfc]. There is a certain calculus around uniqueness that follows from that, and that consumers of UUIDs have come to expect. That uniqueness contract breaks down when you start fiddling around with the now "random" (in air-quotes) bits.

... and, too, as soon as timestamp information gets encoded into a v4 uuid, someone somewhere is going to write code that relies on that, and then bitch about how actually-random v4 ids are breaking things. Count on it.

Hence, I consider COMBs to be non-standard. As such, I'm reluctant to provide support for them here at this time. Having RFC-compliance be the barrier to entry for code in this module has served this project well over the years.

That said, I get that there are compelling reasons for a new form of ID. I guess my suggestion for moving forward would be to treat COMBs as the different version of UUID that they are and move forward as such:

Put together a working group tasked with revising RFC4122 to add a "Version 6, combined UUID" (There's room in the version field)
In the meantime, create a "working standard" implementation to act as a reference, and that will let people concerned with this issue move forward, starting with a new repo in this org (uuid-v6-experimental?), but possibly moving elsewhere if support for other languages becomes a concern.

I could take a stab at implementing that new module if nobody else is so inclined but, honestly, I think it would be better if someone other than @ctavan or I took ownership of that. The more people invested in this org and this code space, the better.

ramsey commented 4 years ago

someone somewhere is going to write code that relies on that, and then bitch about how actually-random v4 ids are breaking things. Count on it.

In the nearly 7 years this has been supported in ramsey/uuid, I’ve not had a single complaint.

See also http://gh.peabody.io/uuidv6/

I plan to support this recommendation in version 4 of ramsey/uuid. It has been mentioned to the IETF but, without implementations, they didn’t wish to discuss it.

ctavan commented 4 years ago

Thanks for pointing to the uuidv6 draft @ramsey, that sounds really promising! Looks like 1. of @broofa's suggestions has already been done 😉.

I think in the light of producing more implementations to encourage standardization it might be worthwhile to implement it for JavaScript.

As @broofa suggested I think we would be happy to host this new module here in this org. However I would also prefer to release it as a separate module until there are more concrete plans to actually standardize such "v6" uuids.

ctavan commented 4 years ago

Here is already an implementation: https://github.com/aarondcohen/id128/blob/master/src/id/uuid-6.js

broofa commented 4 years ago

See also http://gh.peabody.io/uuidv6/

@ramsey, Interesting article, thanks for sharing! I think it highlights the need for a broader conversation, though. Simply riffing on version 1 is, imho, not the right approach. There is more than one issue with v1 that should be addressed in a proposed v6 format. Specifically:

The 64-bit Gregorian timestamps format is problematic (in JS) and wasteful. It should not be propagated to any new format
clockseq is often underused or not used at all. Allow for a variable width field?
MAC addresses are a non-viable source of uniqueness and any verbiage relating to them should be removed

I welcome a new format but - and I say this with nothing but respect for your work - if we're going to create a v6 UUID spec I think we can, collectively, do better.

ramsey commented 4 years ago

These are all historical artifacts of the GUID format, originally created by Microsoft. I think if you were to try deviating from the RFC 4122 variant enough, you might consider creating a new variant.

ramsey commented 4 years ago

For reference, here’s the IETF thread I mentioned. I was wrong; they didn’t state they needed more implementations before discussing it. I had that confused with something else, I think.

https://mailarchive.ietf.org/arch/msg/ietf/8h5cs3CHzYpbKIOc4FxO730OD4o

ctavan commented 4 years ago

@bradleypeabody as you can see from the thread above we have been discussing your UUID v6 proposal here. The linked email thread on the IETF mailing list seems to end without any conclusion. Would you like to share some more insight on what happened to your initial v6 proposal? How do you think about it today?

bradleypeabody commented 4 years ago

@ctavan Thanks for asking. To summarize some thoughts on it:

The v6 idea I came up with before was intended to be a small change to the existing standard in order to preserve backward compatibility.
In retrospect, it is not clear how much use backward compatibility is or how much importance should be placed on it. For example, UUIDs use hex encoding in the string representation. This plus the useless dashes, is rather wasteful (more than double the space). This in turns leads to wanting to store it in binary format in memory and on disk, but then needing to show it in a human readable format for users, urls and debug, but then if you do this in a database you'd have to implement this display logic there, etc. - it sort of becomes a mess. (I ended up creating a different format using a base64 variation which I use in a lot of my own applications, it's just more practical.)
And all of that for what benefit? So some other program can potentially look and say "oh this is a v6 UUID and here is the timestamp" - that's the only benefit I can see to the rather complex existing scheme of a UUID and it doesn't seem to be worth forcing such a rigid format for this one case. If an application wants to encode timestamps into its IDs it makes sense that that could be a supported option, but the idea that any UUID should always have a standard place for its timestamp seems like a specific application requirement that has needlessly bled into the standard (IMO). (I realize there are the other UUID types that don't have a timestamp, I guess what I'm getting at is that the entire scheme is quite complicated and yet has very little use.)
It might instead be more useful to just start over with a format that meets some of the most common needs one finds today, for example:
- Compact text format. It might make sense to have a few different options. Some applications will want IDs that are easy to use in URLs, some might want ones that insensitive to case changes. Using a format that sorts in a logical way in ASCII or UTF-8 is also useful for many applications. I think working out a short list of useful encodings and letting the application/user pick could be a good way to go.
- Variable amounts of random data. Some applications may want a higher guarantee of randomness than others. The spec can include some suggestions but it would make sense to have this be configurable.
- Some variations on the time format. Encoding timestamp information can be useful for several reasons (sorting by creation time, db index locality, an alternative to a creation timestamp). It should be optional. And some applications might only care about precision down to the second, or even as coarse as the day. Others might want nanosecond precision. Again, a short list of useful options can be made without too much trouble. The timestamp should also be arranged so in conjunction with the text format one can make IDs that sort correctly in ASCII/UTF-8 by date sequence. This helps with database index locality (new records generally end up close to each other and on the same page(s)) and is also just useful.
- If an application knows what settings were used for options like the above then it should be able to decode the UUID into its parts (e.g. to extract the timestamp). If not, then it should treat it as an opaque string. The ability to decode a UUID is application-specific and thus we should not enforce extra baggage on all applications just because a few need to extract a time stamp.
- And I agree that several of things from the existing spec like clock sequence are not particularly useful and often not implemented correctly or at all anyway. Timestamps, randomness, text encoding: those appear to be the important issues. Anything that does not directly address important issues should be removed for simplicity.
- Perhaps some sort of shorthand could be made of how to indicate the above options. I'm not sure what it would be, but if a standard is created and it catches on I can see it being useful to be able to tell your database "give me a UUID column with a second-precise timestamp, 64 bits of random data, encoded as url-safe base64" and that this could have some standard string representation like (just as an example off the top of my head, totally needs more thought and discussion) "tsec-r64-url".

Those are my thoughts. I'd be interested in trying to put together a draft with anyone else who wants to work on it. Probably the first step would be agreeing on the high level requirements and then summarizing those in a doc, followed by some initial idea of what the spec might look like. If someone wants to take the node/JS implementation, I can do a Go implementation and probably a C one too.

ctavan commented 4 years ago

Thank you for your write up, @bradleypeabody! I'm definitely happy to help exploring around this topic (and potentially contributing a JS impl). I'm also fine with moving the discussion over into your repo.

felixge commented 4 years ago

This in turns leads to wanting to store it in binary format in memory and on disk, but then needing to show it in a human readable format for users, urls and debug, but then if you do this in a database you'd have to implement this display logic there, etc. - it sort of becomes a mess.

Depends on the database. PostgreSQL turns text UUIDs into binary and back transparently to the point that application developers don't have to worry about it much.

In fact, using anything other than UUID will force you to reimplement this text<->binary conversion yourself.

(I have no horse in this race, just wanted to point this out)

ctavan commented 4 years ago

@felixge I bet you have only been waiting to implement such a conversion as a PostgreSQL extension, right 😉?!

felixge commented 4 years ago

@ctavan my comment was actually a warning against throwing the baby out with the bathwater ; ). Additionally, IIRC, PostgreSQL doesn't care how your UUID looks and if certain version related bits are set. Any 128 bit value is accepted.

So not sure if a new id format is needed. There is also a lot of prior art in this space that should be considered.

bradleypeabody commented 4 years ago

@felixge It's a valid point for sure. I do definitely think a new format is needed, simply because the existing one doesn't provide much functionality for most applications and the whole setup can be made simpler. Allowing variable widths for both time and random bytes I feel is extremely useful for a wide range of applications.

I also think that many implementations can just use (store) the string format directly if it is compact enough.

However, it is certainly possible to combine the idea of multiple text formats with having one that is backward compatible. The hex encoding form that is used along with the bit positions, it's not terribly complicated to encode - as you say it's mostly a matter of putting the version number in the right place. I believe Cassandra's UUID support is like this too and I'm sure others. I'll add to the list of ideas to review.

bradleypeabody commented 4 years ago

For anyone interested, an IETF draft for UUID version 6 has now been submitted. We'll see what happens.

https://tools.ietf.org/html/draft-peabody-dispatch-new-uuid-format-00 https://github.com/uuid6/uuid6-ietf-draft/blob/master/draft-peabody-dispatch-new-uuid-format-00.txt https://datatracker.ietf.org/submit/status/109939/

ramsey commented 4 years ago

@bradleypeabody When you submit, does it automatically open up a discussion thread on the IETF mailing list? If so, which mailing list (I'm subscribed to the general one, but I don't see a thread for it).

bradleypeabody commented 4 years ago

@ramsey I believe the discussion is separate from the submission. I did a bit more discussion on the main IETF mailing list (ietf at ietf dot org) and the upshot of it was to submit a draft that lays out the proposal in detail - with particular attention to why it's needed. From my understanding the IETF meets quarterly and their next meeting is in a few weeks, which is when it would get some official feedback (although I don't know what form that feedback takes - this is new to me). I'll also reach out separately and see if I can get some more interest in the subject in general. I did get some great responses on the mailing list but the case of why this is needed will definitely require some advocacy to make it happen (to become an actual RFC and hopefully get some adoption).

tanglebones commented 4 years ago

@kael-shipman if you want something like the proposed uuid v6 I use https://github.com/tanglebones/pg_tuid which time prefixes the top bits of a UUIDv4.

broofa commented 3 years ago

Closing in preference to #580. There's been some good discussion here, but it's getting a bit long in the tooth and I suspect most/all of it is stale compared to the IETF conversations that have been taking place.

uuidjs / uuid

Support for comb v4 UUIDs #303