scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.37k stars 1.27k forks source link

Possible feature: Flake IDs #7309

Open aphyr opened 4 years ago

aphyr commented 4 years ago

Following up on a discussion with @kostja and #7170: it might be nice for Scylla to offer Flake IDs for server-side timestamp generation.

If you're not familiar, [http://yellerapp.com/posts/2015-02-09-flake-ids.html](Flake IDs) are a mechanism for allocating roughly time-ordered, guaranteed-unique identifiers without coordination. After a single round of coordination to establish a node ID (e.g. at cluster join time), you construct identifiers like [time, counter, node] packed into a single integer. Node identifiers guarantee uniqueness, and using time (and, where multiple IDs are allocated at the same timestamp, counters) for the high bits gives you identifiers which are roughly sorted in time. You could also use this approach to translate client-provided timestamps into guaranteed-unique ones.

Why would you want unique timestamps? Because they prevent timestamp collisions, and without timestamp collisions, you might be able to recover row-level isolation for primitive (e.g. non-CQL-collection) writes!

nyh commented 4 years ago

I proposed something very similar on the Scylla mailing list, but @avikivity righty noted that it will only work in a straightforward and efficient manner for server-side timestamps, because there is a relatively small number of server nodes, and an easy way to coordinate their ids as small integers. We need the ids to be small integers if we want to fit the whole thing - id and time in an acceptable resolution and range - in 64 bits. If we can't fit it in 64 bits, there is a performance penalty and we need to change a lot of existing code, disk formats, and so on. Flake ids - and the similar implementation that Scylla has (timeuuids) - are both 128 bits.

But unfortunately, Cassandra drivers default to client-side timestamps, not server-side timestamps, so we can't just ignore this case, and in this case we can potentially have a very large number of clients with different ids, making the 64-bit problem above a real problem. Moreover, client-side ids also complicates the implementation of the id coordination, although it is not too difficult; For example, a client can receive from a server a "lease" - an ID and a timestamp range to use with it.

aphyr commented 4 years ago

Flake ids - and the similar implementation that Scylla has (timeuuids) - are both 128 bits.

You can use any size. Boundary's Flakes were 160 bits, for example. :-)

Moreover, client-side ids also complicates the implementation of the id coordination

Yeah. You've got a few options here--you can have clients lease IDs from servers, or, (like I mentioned here and in #7170), you can translate client-provided timestamps to Flake IDs by, say, shifting them up by a few bits for internal use, and filling the low bits with the server ID. 10 bits gets you up to 1024 servers, and leaves you 200+ years of operational runway.

nyh commented 4 years ago

Flake ids - and the similar implementation that Scylla has (timeuuids) - are both 128 bits.

You can use any size. Boundary's Flakes were 160 bits, for example. :-)

Yes, as I said we can use any size timestamps, but if we want to move away from the 64-bit timestamps we have today to something bigger, we'll need to change all the memory structures, disk formats, and protocols which today have these 64-bit timestamps. It also means changing the clients, and pretty-much throwing Cassandra compatibility out of the window :-( So it's not an easy decision to make.

Moreover, client-side ids also complicates the implementation of the id coordination

Yeah. You've got a few options here--you can have clients lease IDs from servers, or, (like I mentioned here and in #7170), you can translate client-provided timestamps to Flake IDs by, say, shifting them up by a few bits for internal use, and filling the low bits with the server ID. 10 bits gets you up to 1024 servers, and leaves you 200+ years of operational runway.

This is true. In the past I suggested 12 bits (allowing up to 4096 clients), and the remaining 52 bits allows, if we assume each client can do up to one million requests per second, spanning 142 years. Except that @avikivity thought (and I tend to agree) that limiting ourselves to only 4096 clients is too restrictive. But, if we limit each client to "just" half a million requests per second, and the database life span to 35 years, we can get 15 bits for the client id. That's 32668 different clients. Maybe that's more than enough - and can allow us to continue to use 64-bit timestamps. Maybe we should consider doing this.

kostja commented 3 years ago

@espindola writes:

I would consider it more user friendly to break timestamp ties with the sha1 (with a per execution salt) of the entire update. With that we could get 1,1,-1 or 1,2,-2, but not any mix.

This sounds like a fine option to me. @aphyr, what do you think?

aphyr commented 3 years ago

I think that's valid--would you pack the sha1 into each cell or something?