questions - Githubissues

ghost commented 8 years ago

hi

2 questions

1) ordering of events and clock drift

sqlalchemy_stored_events sorts by a Sequence() which seems to be clock-drift-proof
cassandra_stored_events sort by a timeuuid which mean it is vulnerable to a cassandra server clock drift right?

2) concurrency

if multiple processes are working on the same entity at the same time, will there be concurrency problems?

johnbywater commented 8 years ago

hi

2 answers :)

1) correct, but actually it's vulnerable to clock drift on the application machines because the timeuuid value of the domain event id in generated in the web app rather than by the database, so it's whatever time the application process thinks it is. This aspect is something that would very interesting to work on, the code doesn't yet have causality-preserving hybrid logical clocks :). Things getting jumbled up would be a problem if the time in different processes was comparable to the time period of domain events, so the stored events get jumbled up due to time noise. But if the time noise is smaller than the time between domain events, then perhaps it isn't a problem. Also since we effectively shard the application event stream across individual aggregates, it's the time period between events in a single aggregate compared with the time noise between different processes that is important.

2) there most certainly can be concurrency problems. Sometimes i think whenever a hard problem comes up, like multiple processes working on the same entity, then if it can be avoided by reworking the model, i try to do that. But if you must, for example with an event driven collection of items, it can make things easier to ask yourself if you can stop caring about the order in which events within that entity happen, which might mean making the mutators robust against things being jumbled up, or just accepting the way things turn out if it isn't otherwise broken. An example could be an event sourced set. What happens if two processes each add an item to an event sourced set, and the clocks are slightly out, so the events get "jumbled up"? Does it matter when you get the entity you always have both items in the set? Perhaps it does. You have to figure it out in each case. If there's a problem you have to try to rework the model design so it doesn't break in that way. Return to the domain and see if there'a another angle to take. It might sometimes be necessary to serialise entity operations with a lock.

ghost commented 8 years ago

what do you think of the optimistic concurrency control described in the steps 1.2.3.4. at page 12 of this thesis.

imho it would solve both problems:

clock drifting
concurrency

johnbywater commented 8 years ago

i think it's interesting, I didn't know somebody had written an academic paper about event sourcing with nosql dbs :) i'm sure it would be useful to have optimistic concurrency control as an option in this code.

The way it's done in the paper depends on storing version numbers in a separate table, and I wonder how concurrency would be controlled there? I also feel it's important to support a more degenerate situation, where domain events are always involved in a versioned entity. Some of the events aren't really part of a versioned sequence, for example the additions to a log are often just a sequence and you don't need to go from the absolute start to get value from it. So I'd rather not make everything need to work with version numbers. Having said that, I would think it would be equivalent to the scheme in the paper for this code to check before appending an event (from a versioned entity) that the last stored event from that entity has an entity version that is one less than the event to be appended. At the moment, the code can't do that, and inconsistencies are realised when the entity can't be reconstituted from an inconsistent (branched) event stream. So another option is to auto-merge divergent event streams.

Anyway, optimistic concurrency control would add a lot of value to this code, because it would avoid things getting into a mess, allow operations to back off and retry rather than continuing as they would currently do. It should be quite easy. There's already a fast method for getting the last event of a given entity.

I naively thought it would be possible to code this into a domain entity: I guess, in the method body, you'd just check before you publish the event. But checking involves getting things from the event store, and an entity isn't really very well placed to do that. So the check would have to be in the event store, and to make it optional the domain entity class could have an attribute that can be set on subclasses, and used by the event store to decide when to apply the control. It wouldn't involve the extra table, but it would probably work just as well. Do you think that would be a useful feature?

ghost commented 8 years ago

yes the only check for concurrency is in 2) and i think the solution is to reissue the command:

diagram-1

ghost commented 8 years ago

I think it would be great if it would use the exact db scheme with 3 tables shown in the pdf and no timestamps. for the events not versioned like logs, i don't know maybe they could be put elsewhere. the more specific is a table the easiest it will be to have developers grasp the whole thing and add more backends like dynamo, titan/tinkerpop etc.. imho

johnbywater commented 8 years ago

Nice diagram, am going to try with that, the suffix trees I'm working on in a feature branch need it :)

johnbywater commented 8 years ago

The three table scheme could be implemented in a stored event repo? To support implementing concurrency control in different ways, we could have a method e.g. get_last_version() on the StoredEventRepository which could in some implementations get the latest version number from an entity table, and other implementations could get the last event and read the last version number from that? I'm sure the three table scheme has advantages but I don't really want to completely change e.g. the Cassandra stored event repo because it works quite well and optimistic concurrency control could be done with the current scheme quite easily. Some documentation about how to add a new stored event repo might be helpful for developers to grasp the whole thing. Having different schemes would demonstrate the flexibility of the design?

ghost commented 8 years ago

imho one thing you probably want to avoid is this: (left flow is Process A right flow is Process B)

diagram-2

ghost commented 8 years ago

for the separation of tables, I go with the concept of information hiding "Hide x in xxx.." as written in the book Growing Object-Oriented Software, Guided by Tests

screenshot

........ "Hide Snapshots in the 'snapshots' table " is more in line with this imho than "Hide Snapshots in the 'stored_events' table"

johnbywater commented 8 years ago

Hehe, this code doesn't save events in batches. I just added #15 - it simply checks the last stored event version is one less than the event it's about to append, immediately before the new event is appended. There could still be a race condition that leaves an implicit branch (possibly unmergeable) in the event stream, but I guess it's called "optimistic" for that reason. The smaller the interval between the check and the write, the smaller the chances of writing a branch. So I think it would be more of a concern where events get stacked up whilst various operations are going on before being written out, since this code isn't capable of doing that. :-)

johnbywater commented 8 years ago

Thanks also for your last comment about OO. I tend to agree, that's why since a snapshot of an entity is really matter of an entity snapshot being taken, it seemed to me that the object of concern with snapshots is an event that belongs in the stored events table. I'm not really against having a separate table for snapshots, I just tried to keep the infrastructure as simple as possible. I think perhaps I didn't decouple enough my snapshot implementation from the more important responsibility of getting event sourced entities, so that others could switch in something they prefer. What do you think about that?

ghost commented 8 years ago

save in batch is a must imho to be 100% concurrency-proof. like in the point 3. page 12 of the thesis. In fact saving in batch is related to basic ES imho.

ghost commented 8 years ago

yes snapshots could be decoupled. the fact that make_stored_entity_id() exists means the concept of entity could include many things. Maybe use more precise ubiquitous language like 'aggregate', 'snapshot' instead of 'entity' in this case would help decouple...

ghost commented 8 years ago

with #15 is there a guarantee the commands issued in parallel from different processes are executed in serial when working on the same aggregate?

johnbywater commented 8 years ago

why do you think saving events in batches in a must to be 100% concurrency proof?

regarding decoupling snapshots, i made event player configurable on the event sourced repository class (#16), but think perhaps it would be better coded as a strategy, so might change it to do that

with #15 there are no guarantees because there isn't any locking, so that if a check is made and then an event is written, but another process writes an event after the check is made and before the event is written, then the event stream for that aggregate would be implicitly branched, and effectively inconsistent if the branch couldn't automatically be merged - but your question doesn't describe a situation supported by this code: issuing a command is effectively the same as calling method, so it isn't possible for two processes to issue a command and for those commands to be executed in series because that implies messages crossing the process boundary, and this code doesn't do things like that at the moment. ?

ghost commented 8 years ago

by 'in batch' I mean be sure an aggregate has finished emitting and saving all his events before another process can start saving his owns events for the same aggregate. Example, If an aggregate has 500 events to save during 4 seconds, I want to be sure a second process cannot come and write an event between the 419th and 420th event, thinking no other process is actually looping writing its 500 events.

ghost commented 8 years ago

My perception of the current system vs the thesis is this:

diagram-3

johnbywater commented 8 years ago

Hehe :) but don't you think they are more or less equivalent if in the thesis we assume the number of events involved in a single "save" is always 1? That is the case in this code, and also in DDD a command or an operation on an entity may cause an event to be emitted. That event could be observed, and the observer could run a command that causes another event, but your diagram is about events in the same aggregate. You can code an operation to publish more than one event, but that kinda suggests such a design could be improved by finding a single name for what happened, and putting the pertaining values in that. Which leads to having batch sizes of 1? What is gained by having a save() method and keeping back and then later saving a bunch of events? Overlooking the time it takes to perform the operations that lead to there being several events to save, writing many events would always take longer than writing one event, so having a longer duration, the technique of bunching events seems more likely to cause contention issues?

ghost commented 8 years ago

the batch is not the problem, there is no need to stack events and commit them at the end to get the same problem; A stream represents a whole state of an aggregate, imho one must be sure the stream is saved sequentially and uninterrupted by inserts from other processes so ultimately the events can be replayed to respect possible invariants in the aggregate. So when the next process comes and replay the events, the invariants inside the sourced aggregate are ok and the validation of the sourced object succeeds. Knowing the object is valid, it can work on it from there. Not all versions of aggregates are valid to work on. That's why the last version of a valid aggregate to work on is in the other table imho.

ghost commented 8 years ago

and yes the events can leads the process manager to emit other commands.

johnbywater commented 8 years ago

please could you clarify why having the version number in one place or another in itself makes any difference? It seems to me that if you want to serialise access to an entity then you need locks, and without locks and with more than one thread it may happen that the entity is simultaneously retrieved twice, that the last version is simultaneously checked twice, and then both simultaneously try to write a new event at the same version number, thinking they both have the latest valid version - which is something that would happen with decreasing probability the closer together are the check for the last version and the writing of the next event. But if the domain model is not entirely designed to avoid such a situation, e.g. there isn't any locking, then towards making things more reliable, the situations which could be deemed as causing a "broken" event stream could instead be reconceived as causing a "branched" event stream, and the mutator functions could be coded to support auto-merging that (perhaps that's a kind of "eventual consistency"?), or subscribers could implement a policy to resolve conflicting branches e.g. whenever an item in a shop is sold twice, then refund one of the buyers (that's how Amazon works, apparently)

ghost commented 8 years ago

the amazon problem is related to the Read Model, it's not the same part of the flow as our discussion is about the Write Model imho:

amazon-problem

ghost commented 8 years ago

the valid version to work on can be anywhere. its sole presence in the aggregate table means it is the last valid version number to work on.

ghost commented 8 years ago

there has to be no eventual consistency/merging/conflict resolving/locks in the write model. serialized access to an aggregate and ordered events is all what it is needed

johnbywater commented 8 years ago

hehe i like your diagrams :) do you perhaps have a diagram showing how putting the version number in the aggregate table serializes access to an aggregate?

ghost commented 8 years ago

?

johnbywater commented 8 years ago

it seems that we have isolated the concern to be about serialisation of access to an entity, and your proposal seems to be to have an aggregate table that has the version of the aggregate. Could you suggest how access is serialized, and what constitutes the valid version to work on. Is there always a valid version to work on? Does the valid version have anything to do with serialisation of access to an entity? I'm not trying to disagree with you, just trying to understand what you are proposing :-)

johnbywater commented 8 years ago

ah maybe they don't have anything to do with each other, the last version, and serialising access to the entity? did i get them mixed up?

ghost commented 8 years ago

screenshot-4

johnbywater commented 8 years ago

@pouledodue I reimplemented the optimistic concurrency control so that it doesn't suffer from the "time of check to time of usage" bug. I wanted to thank you for prompting about this and making me work a bit harder on it. The solution is optional and is currently not enabled by default. If you had any feedback about the new implementation, I would appreciate it. Hope all's well with you.

pyeventsourcing / eventsourcing

questions #13