Revise Schema - Githubissues

bwaidelich commented 7 years ago

From @bwaidelich on September 23, 2016 12:56

The current Schema for the `es_stream` table

stream_name (TEXT)

Makes a lot of sense ofc but I think it could be renamed to streamname to be in sync with other tables. Apart from that I would suggest to store the stream name in the format: <BoundedContext>:<AggregateName>:<AggregateId> instead of <FullyQualifiedAggregateClassName>:<AggregateId>. (where <BoundedContext> could be just the PackageKey)

stream_name_hash (VARCHAR 32)

Why do we need that? If it's about performance optimization I'd really like to test this with some realistic numbers (or postpone it until we run into problems)

commit_version / event_version (BIGINT)

In the code commit_version is documented with Version of the aggregate after event was recorded and event_version with Version of the event in the current commit but I still don't get the difference and why we need it? I think that an event in the table needs a sequential number within it's stream as well as a global sequential number to ensure deterministic ordering (see below). Nitpick: I called it "version" before but actually it's the event number or index in its stream

event (TEXT)

I'd suggest to store event type and event payload separately. As with the streamName the event type could be stored as <BoundedContext>:<AggregateType>:<EventType> instead of <FullyQualifiedEventClassName>. This will be easier to read and less likely to change. And the event payload could just be stored as JSON (maybe using the JSON datatype if it makes sense)

metadata (TEXT)

Makes complete sense but IMO the __type can be removed.

recorded_at (DATETIME_MICRO)

I'm not sure if this needs to store microseconds as one should never rely on the order based on that timestamp. But OTOH it doesn't do any harm.. Except: It's missing the timezone currently.. That might be an issue, but I'm not sure about this one. And, nitpick: The underscore could be removed for consistency

Global order

We need a deterministic order for all events in the table independently from the stream the belong to (see https://github.com/neos/Neos.EventStore/issues/1#issuecomment-249173650). Fortunately we can use AUTO_INCREMENT for that. A simple schema that worked quite well for me:

CREATE TABLE neos_eventstore_events (
  id int(11) NOT NULL AUTO_INCREMENT,
  stream varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  number int(11) NOT NULL,
  savedat datetime NOT NULL,
  type varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  payload longtext COLLATE utf8_unicode_ci NOT NULL,
  metadata longtext COLLATE utf8_unicode_ci,
  PRIMARY KEY (id),
  UNIQUE KEY UNIQ_5387574AF0E9BE1CBF1CD3C3 (stream,version)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Copied from original issue: neos/Neos.EventStore.DatabaseStorageAdapter#7

bwaidelich commented 7 years ago

From @dfeyer on September 23, 2016 17:12

Why do we need that?

stream_name_hash is for query and efficient DB indexes, like in the CR, if you search by stream identifier you should query the hash it's way faster

In the code commit_version is documented with Version of the aggregate after event was recorded and event_version with Version of the event in the current commit but I still don't get the difference and why we need it?

Left over of the previous logic, we can forget the commit_version and 👍 for an autoincrement value for deterministic ordering

Except: It's missing the timezone currently.. That might be an issue, but I'm not sure about this one.

It's UTC time, technical storage time don't really need timezone I think, if the timezone is required it's a domain responsability and will be stored in the event payload

Full 👍 for the other comments

bwaidelich commented 7 years ago

From @albe on September 23, 2016 18:30

Beware of autoincrement, as it forces a full table lock. But yes, we need a monotonical increasing numbering for the events or at least for each commit. So by default autoincrement but allow for providing the number from the outside maybe.

Regarding the hash, it wont work for stream prefix searches unless we split context/aggregate/Id in separate columns and hash each individually. Lets benchmark the real gains

Otherwise +1 with Dominiques comments.

bwaidelich commented 7 years ago

From @robertlemke on September 23, 2016 18:56

Also agree with Basti's and Dominique's comments.

👍 for using UTC - and if it doesn't hurt, microseconds sure. 👍 for using a database-native JSON type if possible. That potentially allows us to search even for specific values in a payload or metadata - if we ever want to.

bwaidelich commented 7 years ago

From @dfeyer on September 23, 2016 19:25

Regarding the hash, it wont work for stream prefix searches unless we split context/aggregate/Id in separate columns and hash each individually. Lets benchmark the real gains

That can be done, stored in smaller var char and no need of hash for context aggregate and id field

bwaidelich commented 7 years ago

From @dfeyer on September 24, 2016 12:11

Basically everything is fixed in https://github.com/neos/Neos.EventStore.DatabaseStorageAdapter/issues/7

bwaidelich commented 7 years ago

From @albe on September 24, 2016 20:7

Have put a bit more thought into some of the topics... long comment incoming, sorry!:

commit_version / event_version I still don't get the difference and why we need it?

Maybe this comment from Jonathan Oliver (http://blog.jonathanoliver.com/event-store-transaction-integrity-without-transactions/#comment-198259156) explains the intention a bit:

There is a big difference and they are both critically important.

The stream revision is basically the version of your aggregate root. Every single event that your aggregate generates increments the version by one within your aggregate. The reason I call it "revision" and "stream" in the event store is that you can use it to store more than just aggregates. A command to an aggregate may produce one or more events. An event is a domain concept.

A commit is storage specific and is not a domain concept. The idea behind a commit is that there are one _or more_ events that should be committed as a unit. For example, when an aggregate generates 2+ events as a result of a command, the two should be committed together. The reason for the commit sequence is for optimistic concurrency and sorting in storage. By using a commit sequence, I can determine if someone else has committed ahead of me. For example, I load aggregate ID: 789 and so does another process. The other process is finished and saves before I do. When I go to save, it attempts to write but finds another commit with the same commit sequence for aggregate 789. We can then throw an optimistic concurrency exception.

I think the optimistic concurrency check can also be achieved with the event/stream version (writes only ever happen to a single stream), but I think the commit version makes sense for debugging/analytic purposes, in order to distinguish events that got committed together (note though that for that it only needs to be unique, nothing more). Otherwise this information is lost (unless we guarantee all events within one commit have the exact same savedat timestamp, meaning that the EventStore needs to provide that to the storage).

I think that an event in the table needs a sequential number within it's stream as well as a global sequential number

http://stackoverflow.com/questions/2948523/in-cqrs-event-sourced-do-you-need-a-global-sequence-counter-in-the-event-stor

Greg Young's take on it:

Order is only assured per a handler within an aggregate root boundary. There is no assurance of order between handlers or between aggregates. Trying to provide those things leads to the dark side.

I think he has a point - a strict global order implies a causal relation, but that can not be guaranteed ever between aggregates (unless you do sagas). The only thing that can be said is that there is a chronological order (ie. time the events occured - or if that should not have a meaning in the storage layer, the time the event was stored). So conceptually, the only (fixed) "global order" that can be guaranteed is occured/savedAt, streamName, streamVersion.

But there are also good reasons for a strict global order, if the storage can provide it and it will not bottleneck your system:

http://squirrel.pl/blog/2015/09/14/achieving-consistency-in-cqrs-with-linear-event-store/

That blog post also mentions the usage that I found for the global number (see https://github.com/neos/Neos.EventStore/issues/1#issuecomment-249316966): Having fake strongly consistent queries, by querying against the global version ID returned by the last command. So I'd say we keep the global order number as an implementation detail/benefit of using the DatabaseStorageAdapter. But the contract from EventStore to storage should only be commit(events[], ...): commitId The DatabaseStorageAdapter can then return the strictly monotonic global ID.

Nitpick: I called it "version" before but actually it's the event number or index in its stream

I'm not so sure about "number" or "index" - those are both pretty arbitrary names that don't carry the real purpose, especially considering we have a single table that contains all events globally, but this "number" is relative to a stream. At least it should be streamnumber, or sth. Sequence also comes to my mind as better capturing the purpose.

Also, the more I think about it, the stream sequential number is actually something that should come from the outside, the only contract it has is to be strictly monotonic increasing. So it basically is the aggregate version in the case of aggregates. I start to like the term revision that Jonathan used. Each new event appended creates a new version of the stream as a whole, so the single event is a revision of the stream.

bwaidelich commented 7 years ago

From @albe on September 24, 2016 22:1

An event is a domain concept. A commit is storage specific and is not a domain concept.

One more insight just happened to me: I finally understand the concept of the "commit" table in Jonathan Olivers eventstore storage implementation.

The storage only knows "commits", which may be a collection of events. Also, a single commit is a single transactional atom. It makes no sense whatsoever to only query for a part of a single commit. This in turn means that any query for a specific "version" is always for a commit version (and a query for a specific point in time is in turn for a commit timestamp).

That way, you basically eliminate the need for transactions on the database-level (and don't even need the "streams" table at all, other than for additional eventually consistent metadata).

Now of course that doesn't mean we need to implement it like that. It's an optimization. But it's good to keep the concept in mind.

bwaidelich commented 7 years ago

I went ahead and created a little fixture to benchmark the need for a _hash column (see https://gist.github.com/bwaidelich/e5f6d9dcda79672c4b8202d302d84acf if you want to re-evaluate). If we use a VARCHAR for the stream name (255 characters should be enough for a <BC>:<AggType>:<UUID> name) and an index, I couldn't measure any noticeable performance hit (probably less then calculating the MD5). I tested with 500.000 events spread over 50.000 aggregates and the same with only 5 Aggregates (and loads of events per aggregate). No difference.

neos / Neos.EventSourcing

Revise Schema #43

The current Schema for the `es_stream` table

stream_name (TEXT)

stream_name_hash (VARCHAR 32)

commit_version / event_version (BIGINT)

event (TEXT)

metadata (TEXT)

recorded_at (DATETIME_MICRO)

Global order

neos / Neos.EventSourcing

Revise Schema #43

The current Schema for the es_stream table

stream_name (TEXT)

stream_name_hash (VARCHAR 32)

commit_version / event_version (BIGINT)

event (TEXT)

metadata (TEXT)

recorded_at (DATETIME_MICRO)

Global order

The current Schema for the `es_stream` table