Integrate Snowflake IDs for statuses

Gargron commented 7 years ago

Snowflake IDs are an idea that is used by Twitter, Discord, Instagram and many more. Here is the problem layout:

Timelines are ordered by unique IDs, this allows keyset pagination (since_id, max_id)
Timelines are meant to be chronological
This is not an issue for local toots, because they get an assigned ID when they are created
However, federated toots may arrive with various delays
There are processes that "pull" remote toots into the system as well (thread resolving, the search for toot URL thing)
In those cases, they also get an ID assigned when they are created locally, which can differ wildly from what their created_at timestamp is

A snowflake ID is a big integer that consists of a precise timestamp, a sequence mask and a couple more optional parts to ensure uniqueness, depending on how many different generators run at the same time (i.e. different workers/shards/regions etc)

Because the outermost part of a snowflake ID is a timestamp, chronological sorting is built in. And while the timestamp is most safely-unique when it contains precise milliseconds, I think those can be padded for remote toots (whose timestamp does not carry ms precision)

[x] All status ID columns in the database schema are bigint
[ ] A snowflake generator must be integrated for status IDs instead of psql's autoincrement
[ ] The snowflake generator must accept a timestamp parameter from the past

This is a somewhat complicated area, and I welcome any help & expertise on the issue. Here are a couple of resources for reference:

Another note is that Twitter, Instagram &co can write separate high-performance daemons for ID generation. A separate process just for that would ensure that all the different web and sidekiq workers would be calling the same ID generator, thus ensuring that the sequence part in the ID is unique. That would be a simple thing to do, however there is a limit to how many different processes I want to impose on small instance admins (there is already: web, sidekiq, streaming)

Another approach is having the ID generator in Ruby code. However, because there is a variable amount of web and sidekiq workers that could call this code in each of their separate processes, they wouldn't share the sequence part. As I mentioned earlier, this could be mitigated by a "worker index" part in the snowflake ID. However, the problem is that the various web and sidekiq workers do not know information like "I am worker number 2 out of 5".

The first linked article describes the generator at the Postgres schema level. This is the most promising angle, because Postgres is definitely the one source of truth in any deployment. However, I have no experience writing custom Postgres functions.

Thanks for reading and hope we can figure something out!

mbilokonsky commented 7 years ago

IIRC Twitter goes so far as to encode a variety of additional data into their ID, the timestamp is a part of it but they have user and network metadata in there as well.

We used that as inspiration for our IDs in a distributed database I used to work on. If you're willing to let the id be kinda long, you can even do something like:

[128bit timestamp][32bit host IP][variable-length username hash][2bit privacy level]

Or something like that. That way you can get a ton of data about the post just from the ID, without having to load and deserialize the whole post. Not sure if this is exactly of interest (and not sure how you'd prevent people from spoofing it, in a wide-open federated system) but it's an interesting space for sure!

Gargron commented 7 years ago

Oh, this ID is not for federation purposes!

The federation specs use a Unique Resource Identifier, which looks something like tag:example.com;2017-02-02-13:30;objectType=Status;objectId=1234566 - such format is not technically mandated by anything, but supposed to be unique on the web (hence namespacing by domain), although it is widespread among RSS/Atom feeds in general and GNU social in particular.

The URI is not used for sorting or paginating anything, unlike the local ID: dynamic timelines (local, federated, hashtag, profiles etc) are sorted by ID; and home timelines which are really redis sorted sets, are keyed by ID. Again, created_at cannot be used for that purpose, not because of indexing, but because it's not unique, and pagination by max_id/since_id requires that.

Encoding meta data into the ID is definitely an idea!

liberapay-admin commented 7 years ago

If you only want a unique local ID for time-based pagination, then why not simply concatenate the created_at timestamp and the existing auto-incremented ID?

Edit: oops, didn't mean to post with this account, switching to @Changaco.

Gargron commented 7 years ago

[timestamp][autoincremented id instead of any overflowing sequence id]

Hmm that's so simple I have to wonder if there's something wrong with it.

KirinDave commented 7 years ago

If you're pushing it into the database you don't need a lot of extra stuff, you just need some disambiguation inside the timestamp. I can only think of a two downsides to pushing it into the DB: It couples you even more tightly to PSQL, and it means that posts don't have IDs until they're committed to the database (which reduces the quality of your logs slightly in the event that PSQL can't keep up).

riking commented 7 years ago

this could be mitigated by a "worker index" part in the snowflake ID. However, the problem is that the various web and sidekiq workers do not know information like "I am worker number 2 out of 5".

You can use the PID and a few random bytes (2?), that are invalidated in an after_fork method called every time the process forks in the child. discourse@discourse: search for after_fork https://github.com/discourse/discourse/blob/f5f54c1b77003379af00b88276fd5d32634f89ee/lib/discourse.rb#L352

module Mastodon
class Snowflake
class << self
  def after_fork
    @process_id = nil
  end

  def process_id
    @process_id ||= SecureRandom.random_bytes(4)
  end
end
end
end

edit:

Hmm that's so simple I have to wonder if there's something wrong with it.

As long as the pg database is our single source of truth - go for it! Make it a stored procedure that calls the sequence generator and takes the target timestamp.

seanlinsley commented 7 years ago

Again, created_at cannot be used for that purpose, not because of indexing, but because it's not unique, and pagination by max_id/since_id requires that.

Why do max_id / since_id require that it be unique, exactly?

Changaco commented 7 years ago

@seanlinsley Because pagination requires a stable order, and you can't do a stable sort with non-unique keys.

stash commented 7 years ago

[timestamp][serial][worker_unique] - lexicographic sorts by time and is unique. Need to use a consistently-cased hex timestamp (mixed case won't sort right; force lower or upper, doesn't matter which, just needs to be the same).

timestamp can be a millisecond clock or better, though you probably want to format it as 64-bit lowercase hex to preserve sorting past 2038. Probably 48-bit, lopping off the most significant bits, is also sufficiently long.
The serial value is just an incremented global within a server process. Could just be a wrapping counter. Sole purpose is to support multiple toots per millisecond. Unsigned 16 bits will give you uniqueness up to 65 megatoots per second per process.
The worker_unique is a worker unique number. I'd recommend a 128-bit secure-random (e.g. SecureRandom.random_bytes(16), like how @riking above proposed).

So, 64 + 16 + 128 = 208 bits. 208 / 8 = 26 bytes. 26 * 2 = 52 hex characters. Or, if this is too long for your storage model, could go to 48 characters if use instead a 48-bit timestamp instead.

Security considerations:

There might be some shenanigans hackers could pull with the timestamp, like detecting what individual servers think the current time is, but probably this is unavoidable given the requirements.
If you don't want to reveal the freshness of a process to hackers -- which can be useful if they're trying to crash a worker and receive confirmation that it worked. Recommend starting serial at a secure-random offset at process launch if this is a concern.
It's not worth exposing the internal IP address, MAC address, or PID to hackers. Recommend just rely on the CSPRNG for worker_unique. Admins of the OS will need to ensure good seed entropy at boot, so that should be documented in the setup guide (and is generally great advice anyway).

seanlinsley commented 7 years ago

@Changaco why exactly does the order have to be stable? The UI shouldn't care, and anyone consuming the atom feeds should be using the Unique Resource Identifier.

Can't we just order by created_at, id in SQL?

alexduf commented 7 years ago

Assuming I understood what the problem was (a unique ID, that allow sorting consistently), you want to protect yourself from having two messages at the same ms.

That being said, once two messages are posted at the same ms, nothing really tells you which one was first, which one was second. There's no option more valid than the other.

In other words:

[timestamp][random number]

You can put that computation anywhere you like, no need to keep a state.

alexduf commented 7 years ago

And if anyone is worried two random number may collide, it's just a matter of picking a big enough random number. UUIDs are a good example, and very unlikely to collide: https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions

riking commented 7 years ago

@alexduf The point of the serial part is performance - generating real random numbers takes time, and by using a 16bit serial you cut the load on the RNG by a factor of 65k.

Changaco commented 7 years ago

@seanlinsley If the order isn't stable and many messages share the exact same timestamp then pagination wouldn't work reliably.

Can't we just order by created_at, id in SQL?

That's what I meant by "concatenate the created_at timestamp and the existing auto-incremented ID". It's stable and simple to implement.

alexduf commented 7 years ago

@riking does it need to be a "real random number"? We usually seek proper randomness for security application, here I think we're just trying to sort on that id and make it somewhat unique. When it comes to performance, I doubt the CPU used for random number generation would be the bottleneck. Most probably you'd bottleneck on the database or network or even json parsing and serialisation long before random number generation is a problem.

Especially as the operation should be executed only once per message posted.

Or have I missed anything?

Gargron commented 7 years ago

@Changaco Your solution sounds like the simplest one, and probably the best. I am trying to figure out if there are any drawbacks. I'm assuming the reason Snowflake's sequencer was invented because they didn't want to generate the ID at a database level, or had multiple databases, which is not a concern for Mastodon's comparatively small target deployment (i.e. we'll never go cassandra etc)

When remote servers provide their messages' timestamps... And so those could be picked deliberately... Is there any risk of collisions, if a part of the ID is still a unique autoincrement? It seems like there wouldn't be a problem, just making sure.

Changaco commented 7 years ago

I'm not familiar with Snowflake IDs but I assume the reason big platforms need them is that they want globally unique IDs, while you only want locally unique IDs that allow time-based sorting.

As long as the auto-incremented ID is unique you can concatenate it with whatever you want and the result will still be unique, there's no way to create a collision.

alexduf commented 7 years ago

I know I'm a bit stubborn, but it's more likely to have two ids colliding because two nodes have the same amount of messages at the same ms, than two big random number colliding at the same ms.

Gargron commented 7 years ago

@alexduf I am most likely going with timestamp+uid, but in snowflake's defence, you can't have same number of messages per ms colliding on different workers because the workers self-identify and include their index in the end output

Gargron commented 7 years ago

Looks like default-value functions in Postgres can't see the record being inserted, so can't use the given timestamp. However, it is possible to reserve the autoincrement value by select nextval('name'), then use that in a calculation in ruby and insert that ID explicitly, without relying on default value.

Does that sound reasonable?

shadowcat-mst commented 7 years ago

default value functions in postgres are the wrong answer for this.

The simple answer is to simply use (timestamp, uid) as a two column ORDER BY key and to treat the combined denormalised form used to work around redis' more limited sorting capabilities as the implementation detail that it is.

If you genuinely believe that turning two integers into a single integer is going to be an unacceptable performance overhead, you could also use an ON INSERT trigger to create a denormalised combined column on the way in - but honestly, if doing that combination in ruby is considered fast enough, then it seems highly unlikely to me that this is expected to be your bottleneck.

Changaco commented 7 years ago

+1, I don't see any reason to change the schema, except for the creation of a unique index on (timestamp DESC, id DESC) probably.

mjankowski commented 7 years ago

Reading through the backlog, it seems like there are a few ideas in here:

It might be nice to have snowflake IDs as part of a solution for unique status IDs across the entire federation (this doesn't solve an urgent problem, because URIs already have this quality, it's just nice?)
If we want/need snowflake IDs, there are various approaches of hashing and random numbers that might be useful in that case
It would be useful to have snowflake IDs with embedded order information on local instances, because it would preserve our keyset pagination ability and makes inserting historical statuses (previous mentioned backfill issue) easier/possible
It might be possible to not bother with snowflake IDs and just attempt to use an (indexed) timestamp+ID combination to preserve pagination

Here a few side thoughts which are not strictly about snowflakes but might be relevant...

I always try to avoid using the postgres-provided database IDs as a trusted source of order information. I assume they are unique values within that postgres instance, and that's about it. If postgres changed their implementation to generate random numbers or hash strings instead of incrementing numeric IDs, that should be ok, because if I only treat them as unique identifiers without other qualities, that stays true. Given that, IDs are useful to have as a "tiebreaker" in ordering when you've already used whatever meaningful sorting criteria there is, but they are not as useful as the primary ordering value.

In Rails apps, I like to let the ORM "own" the created_at column, and have this value reflect "this is the time at which this local record was inserted into the database", and not try to assign more meaning to it than that. So, in Mastodon case, for our remotely generated statuses, I think it would make sense to have a published_at column which tracks the date value we receive from the status feed, and not assign created_at to this time. This would lead to records where the created_at is at a future datetime than the published_at (because there's at least a small delay in getting things from server A creation to server B, and there's a potentially very long delay if we are backfilling on new follows), but that's fine. For date ordered things we'd be sorting by something like (published_at desc, ID desc) and we'd expect that published_at gets us 99% of the way there, and when it needs a tiebreaker it's in a case where two events were created within the same second, and for our purposes here it probably doesn't matter how we break that tie (it matters in like, financial markets, but probably not federated microblogging?!)

(As a side note to my side note, what happens with timelines now if a remote server "lies" about publication dates? Like, if instance A publishes a status at time X, and instance B claims that a reply to that status happened 3 days before X ... what does instance A do with that when it gets it?)

Anyway, given all that, I think I agree with the sentiment in the last few comments that prioritizing things like:

First, try to switch to a (timestamp, id) ordering approach
Assuming that works, maybe think about created_at vs published_at
If there are still issues to solve, explore snowflake ID or other more complex unique ID generation

mastodon / mastodon

Integrate Snowflake IDs for statuses #1059