Add summary about stream strategies

prooph / pdo-event-store

PDO implementation of ProophEventStore http://getprooph.org

BSD 3-Clause "New" or "Revised" License

111 stars 56 forks source link

Add summary about stream strategies #139

Closed codeliner closed 4 years ago

codeliner commented 6 years ago

one table per aggregate: you can just select the entire table, have all events to that aggregate without any where. unique constraints mostly fullfilled by table itself already

one table per aggregate type: you have to search the table with where aggregate_id = ...., takes more time to do the search, also unique constraints now need to be composite keys

one table for everything: selecting all events for an aggregate becomes where aggregate_type = .... AND aggregate_id = .... unique constraints become composite keys of 3 columns (i think)

source: chat message by @fritz-gerneth

codeliner commented 6 years ago

Follow up discussion:

@Ocramius 22:45 Hmm... This goes against my knowledge of BTREE indexes and computational complexity related to them. False overall... In MySQL (Specifically) you get much more issues when opening different tables

Fritz Gerneth @fritz-gerneth 23:59 not defending my stance, certainly not against you don't think we're comparing against opening multiple DB tables. y ou load aggregates from one stream only anyway (or hopefully). so the question is which one is the fastest when loading a single aggregate the other important criteria is probably insert performance (which I expect the one table for everything approach to perform quite bad results as well)

Ocramius commented 6 years ago

To clarify, on most RDBMSs opening an index (and keeping it open), updating and re-organising it is much more efficient than having separate indexes.

That's due to the intrinsic computational complexity of tree-based indexes, which increase in write/access time in much less than linear progression compared with the data they index.

Having separate indexes scales linearly (or worse, since you have more index roots) and increases caching issues around anything regarding file descriptors, so it ends up being actually a lot slower.

The main reason why the question "why one stream per aggregate" popped up is because I frequently perform cross-aggregate projections and simple debugging, and while filtering a stream is trivial (just use the indexes) merging streams in an RDBMs is a flustercluck with at least an O(n) complexity.

Given the capabilities of event-sourcing and DBs, replicating from a single stream to multiple filtered streams is also trivial and efficient, so I don't see any immediate advantages of stream-per-aggregate or stream-per-aggregate-type approaches.

Advantages of a single stream are instead quite evident to me.

Although you obviously end up with a data lake that goes against the micro-services approach, providing each service with a pre-filtered view or materialised view is an acceptable solution with big advantages for anything that wants to perform cross-boundary projections or even simple reporting/debugging.

codeliner commented 6 years ago

@Ocramius thx

I'd like to add that organisational issues need to be considered as well. For example if you want to "archive" old aggregates it is easier if you have a stream of each. Otherwise you would need to modify the "main stream". We provide choices because we don't want to decide which option is the best one for a project. Too many possibilities ...

Also related:

One stream per aggregate In the original event-store implementation of Greg Young (https://github.com/EventStore/EventStore), there is by default one stream per aggregate. That means that not all events related to aggregate type "User" are stored in a single stream, but that we have one stream for each aggregate, f.e. "User-", "User-", "User-", ... This option is also available for prooph-event-store, limiting the usage to disallow this strategy is possibile, but not really wanted. To quote Greg Young: "You need stream per aggregate not type. You can do single aggregate instance for all instances but it's yucky"

source: http://www.sasaprolic.com/2018/03/why-there-will-be-no-kafka-eventstore.html

ping @prolic @oqq as well

maybe also @gregoryyoung

Ocramius commented 6 years ago

@codeliner I'm fine with multiple choices, but again, archiving is just a matter of filtering out stuff from the index.

Having data on multiple streams or a single one is a semantic difference, not a practice one (seriously, it's just a DELETE ... WHERE).

In a PostgreSQL DB I'd just move data out into a partition and then split out the partition once the "umbilical cord is to be cut".

I'm trying to understand ups and downs, and so far I don't see ups in stream-per-aggregate, so I feel like the original reasoning is simply to be dug up.

Good idea to poke @gregoryyoung

gregoryyoung commented 6 years ago

its not a semantic difference. Consider the case of loading a single aggregate.

Ocramius commented 6 years ago

That's an index lookup, which is faster than opening a new index

gregoryyoung commented 6 years ago

Its normally one table for everything or you custom design the db per aggregate type eg have a table named transactions the join to the rest. The latter is common in SQL centric systems

basz commented 6 years ago

@codeliner 22:47

Our default is: One stream for all events if the service is relatively small or one stream per aggregate type for bounded contexts with a lot of logic spread across different aggregates.

You have a method to measure and quantify relatively small?

Also, is there any tooling to switch between strategies, making it less important to get it right from the start?

prolic commented 6 years ago

Okay, I'll try to give a more detailed explanation, feel free to add this to the docs:

Aggregate Stream Strategy You get one stream per aggregate, which means you'll end of with lots of tables. This is really fast for loading an aggregate from events, because you can simply load the whole table. It works really good, when you build your read models using the event-publisher and event-bus (maybe also async with amqp). But it has a big disadvantage, when using the provided projector-implementation, because it has to query all existing tables in a loop which can take multiple hours, when you have enough aggregates. So when you are building projections from event-publisher directly, this is the best way to go. If you are using the provided projector-implementation, DO NOT USE this strategy!
Single Stream Strategy This can be used for one stream for all aggregates or one stream per aggregate type. I would recommend the latter. It works really great with the provided projector-implementation and has not much disadvantages. It might be slightly slower then aggregate stream strategy for loading an aggregate from events, but it's hardly recognizable and can be ignored. USE THIS strategy, when you're doing event-sourcing with the provided projector-implementation.
Simple Stream Strategy This is basically the same as Single Stream Strategy with one difference: It has no constraints, so you can add events with same version multiple types and so on. This can be used, if you don't want to have the optimistic concurrency check on the event-store itself (because you either don't need this completely or you want to use a different external locking mechanism). It can also be used for stream-to-stream-projections where unique version constraint is not wished.

A tooling to switch strategies in v7 does not exist, but it's simply reading all events and writing them back to a new stream (probably using two distinct event-store instances), should be simple to write in an hour or so.

Another side-note: I am silently working on a prototype of event-store v8 (don't ask me for any planned release date yet, it's in very very early stages of development). The plan here is to remove any persistence strategies completely and provide a complete different way of solving some of those issues described above. It's somehow a mix of Single- & Simple Stream Strategy, where you can have optimistic concurrency checks if you want to, or you can have it without it (without switching any db-tables).

oqq commented 6 years ago

I almost always use the "Single Stream Strategy" with one stream per aggregate type in my projects. Maybe there are some performance benefits by using only one stream table. But this would be a mess to debug compared with my current workflow. Also this is the easiest one if one came new to event sourcing from plain old db table layouts. The inhibition level is therefore much lower for peers.

One stream per aggregate was always strange for me and I see only disadvantages by using this strategy. Particularly if the storage is mysql.

We should retain the current flexible model and provide a way to use own strategies, but also we should add some tools to configure all instances and give some best practice defaults. How about a SimpleEventStoreFactory, which only needs some config params like

which database to use (MySQL, Postgres)
connection params (user, password) and all config jobs are done. If someone needs more options, there is always a way for it.

One goal of next versions should be a smarter and easier way to configure the whole prooph stack, without removing flexibility for "power users".

fritz-gerneth commented 6 years ago

Using the MySQL Single Stream Strategy over here as well (or a slightly modified version of it) pretty much for the same reasons @oqq stated. The only change we made is to allow longer aggregate-ids to be stored, not everything as a UUID4 ID.

For me, the event store is only that and projections while easy to use are not the primary use-case for it. Hence, for me two scenarios matter (from a performance perspective):

Load a single aggregate
Append to a single aggregate If opening a table index is more expansive than searching a large table this certainly would be a strong point against the aggregate stream strategy (for both read & write). Maybe the SingleStreamStrategy provides the middle ground on this, maybe it performs worst as it has to do both :) For writing: haven't had a look at insert performance for either strategy. Could index rebalances become an issue here eventually? How would replication settings influence write-speed?

Projections do add a third metric for this: cross-stream loading of aggregates. Not having any performance benchmarks on this. Assume the amount of streams & aggregates has a very high influence here.

Either way I think this is an implementation detail for the store itself, depending on how it is used (e.g. for projections or not). We went for the SingleStreamStrategy with real names (instead of hased ones) simply for simplicity and had not a reason to switch yet. Knowing that is is rather simple to move from one strategy to another this choice is less important at the beginning of a projection. I found this to be the strategy to be understood most easily when starting too.

gregoryyoung commented 6 years ago

stream per aggregate is the default strategy usually its just a FK representing the stream.

On Sun, Mar 25, 2018 at 5:10 PM, Sascha-Oliver Prolic < notifications@github.com> wrote:

Okay, I'll try to give a more detailed explanation, feel free to add this to the docs:

-

Aggregate Stream Strategy You get one stream per aggregate, which means you'll end of with lots of tables. This is really fast for loading an aggregate from events, because you can simply load the whole table. It works really good, when you build your read models using the event-publisher and event-bus (maybe also async with amqp). But it has a big disadvantage, when using the provided projector-implementation, because it has to query all existing tables in a loop which can take multiple hours, when you have enough aggregates. So when you are building projections from event-publisher directly, this is the best way to go. If you are using the provided projector-implementation, DO NOT USE this strategy!

Single Stream Strategy This can be used for one stream for all aggregates or one stream per aggregate type. I would recommend the latter. It works really great with the provided projector-implementation and has not much disadvantages. It might be slightly slower then aggregate stream strategy for loading an aggregate from events, but it's hardly recognizable and can be ignored. USE THIS strategy, when you're doing event-sourcing with the provided projector-implementation.

Simple Stream Strategy This is basically the same as Single Stream Strategy with one difference: It has no constraints, so you can add events with same version multiple types and so on. This can be used, if you don't want to have the optimistic concurrency check on the event-store itself (because you either don't need this completely or you want to use a different external locking mechanism). It can also be used for stream-to-stream-projections where unique version constraint is not wished.

A tooling to switch strategies in v7 does not exist, but it's simply reading all events and writing them back to a new stream (probably using two distinct event-store instances), should be simple to write in an hour or so.

Another side-note: I am silently working on a prototype of event-store v8 (don't ask me for any planned release date yet, it's in very very early stages of development). The plan here is to remove any persistence strategies completely and provide a complete different way of solving some of those issues described above. It's somehow a mix of Single- & Simple Stream Strategy, where you can have optimistic concurrency checks if you want to, or you can have it without it (without switching any db-tables).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prooph/pdo-event-store/issues/139#issuecomment-375959111, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXRWnO5jP7-jCOUR9Du1Q7f6zVohWWRks5th20ngaJpZM4S394M .

-- Studying for the Turing test

codeliner commented 6 years ago

thx for this great discussion! We have many different POV here and I'll try to add each to the docs.

Let me summarize:

@Ocramius looks at the question from a RDBMS point of view. He has deep knowledge of databases and his argument about database internal mechanics is something we have to keep in mind. For me this goes hand in hand with the idea that you split a system into bounded contexts and each context becomes an independent service with its own database. If you can use such an architecture using a single stream for each BC is a very good idea IMHO. In a monolith a single stream might be a way too messy.

@prolic and @gregoryyoung favor one stream per aggregate and this view makes a lot of sense, too. If our MySql/Maria/Postgres event store is used w/ the OneStreamPerAggregate strategy it mimics the internal mechanics of Greg's EventStore, but has some disadvantages compared to it:

Our projections are not built-in, so we have a hard time consuming that many different streams with a projector. A single event stream for all aggregates is the exactly opposite. The projector can process all events in order so you don't need to worry about out-of-order events like you would with a message broker in between. Obviously the projector can become a bottleneck depending on the write load of the system.
Second issue is more related to DBAs. It is just a way to crazy for most people to look at a database and see hundreds of thousands of tables in it. It's a different story with EventStore because that's a different type of database so people are not influenced by their past work. Problem here is: many prooph users are new to the concept and forcing them to use different infrastructure would probably stop them diving deeper into the topic. It's cool to use existing infrastructure and spin up a first small event sourced service to get experience with the topic.

@fritz-gerneth and @oqq use a pragmatic approach and try to balance between performance and ease of use/understand. I'm leaning towards the same direction. While I'd love to use OneStreamPerAggregate it's not my default. If a system needs to handle high throughput I would consider that strategy but not without proper load tests. So far performance of the single stream or stream per aggregate type strategy is more than enough but mostly working on B2B projects where you can calculate the load of the system upfront. Personally I include the projection side into the pros and cons consideration because that's an important part of the system. prooph's projections are a very simple solution to a not so easy problem. Not the best choice for high throughput but I'd first consider another technique if it is really needed.

joshdifabio commented 6 years ago

Regarding the stream-per-aggregate approach with MySQL, has it been considered to change the database schema so that we can rely purely on DB queries to tell us which projections are out of date instead of looping over them all in PHP?

Here is a quick example -- I'm not sure about the scalability of the example queries at the end.

CREATE TABLE events (
  id CHAR(36),
  stream_id CHAR(36),
  sequence_no INTEGER,
  payload TEXT,
  PRIMARY KEY(id),
  KEY (stream_id, sequence_no)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

CREATE TABLE projection_stream_subscriptions (
  projection_id CHAR(36),
  stream_id CHAR(36),
  current_sequence_no INTEGER,
  PRIMARY KEY (projection_id, stream_id),
  KEY (stream_id, current_sequence_no)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

# Get events which need to be processed by projections, as well as projection IDs
SELECT s.projection_id, e.stream_id, e.payload
FROM events e
JOIN projection_stream_subscriptions s ON s.stream_id = e.stream_id AND e.sequence_no > s.current_sequence_no
LIMIT ...;

# Just get the projections which need to be run
SELECT DISTINCT s.projection_id
FROM events e
JOIN projection_stream_subscriptions s ON s.stream_id = e.stream_id AND e.sequence_no > s.current_sequence_no
LIMIT ...;

The queries could obviously also be modified to join to the projections table as well if necessary.

prolic commented 6 years ago

That's a BC break and will not be done. I'm working already on the next major release which solves this problem already.

On Tue, Jun 12, 2018, 18:50 Josh Di Fabio notifications@github.com wrote:

Regarding the stream-per-aggregate approach, has it been considered to change the database schema so that we can rely purely on DB queries to tell us which projections are out of date instead of looping over them all in PHP?

Here is a quick example -- I'm not sure about the scalability of the example query at the end.

CREATE TABLE events ( id CHAR(36), stream_id CHAR(36), sequence_no INTEGER, payload TEXT, PRIMARY KEY(id), KEY (stream_id, sequence_no) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin; CREATE TABLE projection_stream_subscriptions ( projection_id CHAR(36), stream_id CHAR(36), current_sequence_no INTEGER, PRIMARY KEY (projection_id, stream_id), KEY (stream_id, current_sequence_no) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Get events which need to be processed by projections, as well as projection IDsSELECT s.projection_id, e.stream_id, e.payloadFROM events eJOIN projection_stream_subscriptions s ON s.stream_id = e.stream_id AND e.sequence_no > s.current_sequence_noLIMIT ...;

Just get the projections which need to be runSELECT DISTINCT s.projection_idFROM events eJOIN projection_stream_subscriptions s ON s.stream_id = e.stream_id AND e.sequence_no > s.current_sequence_noLIMIT ...;

The queries could obviously also be modified to join to the projections table as well if necessary.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prooph/pdo-event-store/issues/139#issuecomment-396548112, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYEvO96L3_elmuaFDCnhvIBGwJCAPbKks5t75z4gaJpZM4S394M .