prooph / event-store

PHP 7.4 EventStore Implementation
http://getprooph.org
BSD 3-Clause "New" or "Revised" License
547 stars 74 forks source link

Use ULID instead of autoincrement no #361

Closed sandrokeil closed 4 years ago

sandrokeil commented 5 years ago

We can get rid of the autoincrement requirement for an event store if we use Universally Unique Lexicographically Sortable Identifier. This would be a huge improvement.

The assumption why autoincrement no is needed is:

  1. Sort events ASC or DESC
  2. Get events prior or after specific number

Try it out with php-ulid.

        $docs = [];

        for ($i=0; $i<100; $i++) {
            $docs[] = [
                'position' => $i,
                'ulid' => (string)Ulid::generate(true),
            ];
        }

It works in ArangoDB for instance.

FOR u IN transactionCol
  FILTER u.ulid <= "01d4dn314dz3pp0n423pfggjv2"
  SORT u.ulid ASC
  RETURN u
codeliner commented 5 years ago

Very interesting idea :+1:

But there is a problem:

Within the same millisecond, sort order is not guaranteed

This means, we cannot rely on the event ulid, but would need a single writer process sitting in front of the event store server to generate a ulid for each event in a long running process.

The problem is: timestamp bits are followed by random bits. When using a monotonicFactory, the random part should be incremented within the same millisecond.

But if two different PHP processes create events at the same millisecond they would use different monotonic factory instances and therefor different random sequences.

A sequence-like ulid generator in the database could help, at least if you only have a single master instance. Or a centralized ulid microservice.

For event store v8 @prolic is planing a single writer process. So maybe ulid is a great alternative for v8 ?

sandrokeil commented 5 years ago

Oh, only have read Monotonic sort order (correctly detects and handles the same millisecond) and have tried it out. A single writer could increment the position number, so no auto increment is needed and should be more safety.

sandrokeil commented 5 years ago

But if two different PHP processes create events at the same millisecond they would use different monotonic factory instances and therefor different random sequences.

This should be the same case with auto increment and no single writer https://github.com/prooph/pdo-event-store/issues/189

Looks like there is no escape from the single writer approach ;-)

fritz-gerneth commented 5 years ago

Pretty much what we are looking for is a mean to creating sortable / monotonic IDs without a single node. Unfortunately there is no way for this. All approaches either take a single node to generate this (ticket servers / auto increment in DB or redis, twitter's IDs ... single writters in general ) or rely on large numbers to make collisions close to impossible (UUID flavors, ulid).

But if two different PHP processes create events at the same millisecond they would use different monotonic factory instances and therefor different random sequences.

This ticket pretty much sums up the challenges, even within the same monotonic factory.

This should be the same case with auto increment and no single writer prooph/pdo-event-store#189

Auto-increment is a single writer approach. Auto-increment in multi-master setups is a while different beast. My PR solves a different kind of issue: that IDs become visible to reads in different order, making it a gap-detection mechansim technically - the gap detection is to guaratenee there are no read-gaps on the write side. That still would be an issue with any different ID generator. A different ID generator function would require a different gap detection (e.g. on the read-side, similar to what axon does).

Personally I do not see the value of that new kind of ID format. See similiar discussion here and here . So if at all I'd go with UUID1. It has strict linear IDs instead of evenly spread across ID space, which is a key aspect for sorting & indexing ( interesting read) .
But then even UUIDs have one problem: given two events you do not know if there is an event in between or not (e.g. did we miss an effect in our read?).

I think there are a few aspects that needs to be balanced:

Based on that I see three approaches:

sandrokeil commented 5 years ago

Discussed today with @codeliner and maybe this would be a possible solution.

I don't understand how the single writer approach works with eventstore.org or other databases because you need a queue and before you can put something to the queue, you have to check the event stream constraints e.g. unique index. Therefore you have to queue all requests until the previous was saved or rejected. Maybe this works with php-fpm (one child) or swoole (one worker).

If you listen to a stream the position is not relevant, only the order of appearance. The position is only needed for first start or replay if a change stream mechanism is used.

It don't has to be a sequential number. ULID should works fine with a single position handler, so you don't have to get the last used sequential number.

fritz-gerneth commented 5 years ago

• Use a dedicated single position handler per event stream table/collection which updates events with the correct position via change stream, stored procedures, triggers or whatever.

I see a few problems with this approach:

Therefore you have to queue all requests until the previous was saved or rejected. Maybe this works with php-fpm (one child) or swoole (one worker).

Not necessarily. MySQL can do this for you too. The single writer does not need to be a php process but can be the MySQL process too (e.g. with locking). Comes with the benefit that those problems are already solved.

codeliner commented 5 years ago

After our discussion I recognized a different problem:

If you listen to a stream the position is not relevant, only the order of appearance.

This assumption is more complicated. Basically it is correct but we need to be careful! A projection is interested in the event order produced by aggregates, not only the order of appearance.

One can say: For non aggregate events and/or events of different aggregates a projection is only interested in the order of appearance

This means, that observing the Write-Ahead-Log of a database and make events visible for projections only if they appear in the log could work.

But what about aggregate events? A projection likely depends on the exact order of events produced by a single aggregate.

A simple example is: The projection sees an "AggregateUpdated" event before it sees the "AggregateCreated" event. Or even worse: due to a gap never sees the latter.

A single writer can solve both problems, a WAL-observer can only solve the first one, except it can handle aggregate event version gaps somehow.

A single writer does not scale, but EventStore demonstrates that it can handle a lot events per second. If you need to scale, you'd need some kind of partitioning, though. In fact, you only need a single writer per aggregate. A master process could receive events, add a sequential stream position to each event and makes sure that all events of the same aggregate are routed to the same worker process, at least within a range of let's say 10 seconds:

1ms AR1.EVT1 --> MasterProc: StreamPos 1 --> Worker1 --> write to stream
1ms AR1.EVT2 --> MasterProc: StreamPos 2 --> Worker1 --> write to stream
1ms AR2.EVT1 --> MasterProc: StreamPos 3 --> Worker2 --> write to stream
1ms AR3.EVT1 --> MasterProc: StreamPos 4 --> Worker3 --> write to stream
2ms AR2.EVT2 --> MasterProc: StreamPos 5 --> Worker2 --> write to stream

.... 10 seconds later MasterProc can pick another Worker for AR1 because last AR1 event was handled more than 10s before this one

10001ms AR1.EVT3 --> MasterProc: StreamPos 6 --> Worker3 --> write to stream
codeliner commented 5 years ago

@fritz-gerneth

  • first it increases the amount of hops an event has to take until it becomes readable by projections. Assuming a polling interval of 50ms for the sequence writer process and the projection itself this would double the time for events to be readble.

We're not talking about polling but instead observing the Write-Ahead-Log of the database. In MongoDB you even have a public API for it: https://docs.mongodb.com/manual/reference/method/db.collection.watch/

With ArangoDB you can do something similar: https://www.arangodb.com/2017/03/arangochair-tool-listening-changes-arangodb/

But you're right. Time until a projection sees an event is definitely longer.

RobThree commented 5 years ago

@fritz-gerneth

Pretty much what we are looking for is a mean to creating sortable / monotonic IDs without a single node. Unfortunately there is no way for this.

Hmmm, did you consider IdGen?

Disclaimer: author here (I am also familiar with ULID's, that's how I stumbled upon this page).

Within the same millisecond, sort order is not guaranteed

With IdGen you will have sort order within the same millisecond, but you'll sacrifice sort order within hosts/shards/processes/whatever-you-use-to-assign-generator-id's. Maybe that is an option for you?

It designed to be distributed and not requiring a single point of failure like a ticket server you mention. It is modelled after Twitter's snowflake (which didn't require a "spof" either, AFAIK).

prolic commented 4 years ago

Closing due to lack of activity.

sandrokeil commented 4 years ago

FYI: Maybe UUID v6 can be used as primary key, because it supports sorting. But maybe needs native database support for this.

kochen commented 4 years ago

prooph/common v4.4 supports ramsey/uuid v4.0, which supports (the same) UUIDv6