pyeventsourcing / eventsourcing

A library for event sourcing in Python.
https://eventsourcing.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
1.48k stars 129 forks source link

Event migration #103

Closed johnbywater closed 5 years ago

johnbywater commented 7 years ago

There are five approaches... it might be useful for the library to support them.

johnbywater commented 7 years ago

From https://news.ycombinator.com/item?id=13339972

"For migration of immutable events, there's a good research paper[1] that outlines five strategies available: multiple versions; upcasting; lazy transformation; in-place transformation; copy and transformation. The last approach even allows you to rewrite events into an entirely new store."

[1] The Dark Side of Event Sourcing: Managing Data Conversion http://files.movereem.nl/2017saner-eventsourcing.pdf

johnbywater commented 7 years ago

I suppose it would be useful to describe these five strategies in the library documentation?

johnbywater commented 6 years ago

Support for mapping event topics before resolving to a class would allow classes to be renamed and moved to different packages.

julianpistorius commented 6 years ago

This might be useful: https://leanpub.com/esversioning

I haven't read it yet, but it looks promising.

johnbywater commented 6 years ago

Thanks for the link. I'll try to read the book.

okeyokoro commented 5 years ago

@johnbywater does this library support change data capture (CDC)? I think it might be an event propagation strategy you would be happy with. perhaps integrating with something like bottledwater-pg or debezium

johnbywater commented 5 years ago

Thanks for the comment @okeyokoro. I don't know what CDC is, but we can talk about it :-)

okeyokoro commented 5 years ago

First things first, @johnbywater thanks for making this awesome library 🎯! I listened to your talk on Podcastinit and even took notes. You've really helped me make sense of event sourcing and DDD.

Now, on the topic of CDC;

Change Data Capture (CDC) effectively means replicating data from one storage technology to another. To make it work, we need to extract two things from the source database, in an application readable data format:

  • A consistent snapshot of the entire database contents at one point in time
  • A real-time stream of changes from that point onward — every insert, update, or delete needs to be represented in a way that we can apply it to a copy of the data and ensure a consistent outcome.

At some companies, CDC has become a key building block for applications — for example, LinkedIn built Databus and Facebook built Wormhole for this purpose.

Kafka 0.9 includes an API called Kafka Connect, designed to connect Kafka to other systems, such as databases. A Kafka connector can use CDC to bring a snapshot and stream of changes from a database into Kafka, from where it can be used for various applications (in our use case; acting as notification log that other downstream services & datastores can poll, to get new data; just like Vaughn Vernon prescribed).

— the above is an excerpt from Martin Kleppmann's "Making Sense of Stream Processing", O'Reilly

You can get a free copy of the book here

I highly recommend the book, the author lays things out in a very clear, simple and straight-forward manner; making the book a pleasure to read. I'm actually on my second reading 😊

johnbywater commented 5 years ago

Thanks for the compliments! I'm sorry for any misperceptions I am responsible for :-). If you listened to that podcast, then you'll know at the end I said I wasn't sure if the library was finished, because I was still not settled on the distributed system thing, on propagating and projecting application state in a reliable way. Well, since then I did quite a lot of development on that topic. It's all written up under the "Distributed system" page of the documentation. https://eventsourcing.readthedocs.io/en/stable/topics/process.html

The summary conclusion is the "process event" pattern. I didn't see that pattern anywhere else, and after asking around (especially after explaining literally everything about this to Eric Evans over a couple of days following the 2019 DDD EU conference in Amsterdam) I think it's genuinely a new pattern. https://eventsourcing.readthedocs.io/en/stable/topics/process.html#process-event-pattern

Furthermore, I was looking around for antecedents for this topic and, after looking at a lot of things, I found the book Process and Reality by Alfred North Whitehead provides a metaphysics in which the actual entities are occasions of experience. That means we can confidently say the actual world is built up from events (not instances of substance-quality categories). We can also now look back and see how affected by the Greek substance-quality categories is the conventional analysis that we use in software. Virtually all the old literature about OO obeys that mistaken form. In fact, since Process and Reality was published in 1929, and even people like Turing read Whitehead when he was a youth, the entire OO genre could have been started on the more real understanding of the world that Whitehead describes in his book (that the actual entities are occasions of experience, or events). But it remains that without being explicit about "what happens", about the events, there will always be a shortfall or an undersupply of definition of the object of consideration. That's because all that can happen is an event, so if that reality is ignored by a proposition, that proposition will probably be inadequate in one way or another. I think that's why distributed system are said by everybody to be so hard: having misunderstood the world, it's been quite hard to make things work. I'm also increasingly feeling that's why we're still talking about the "agile" approach: the word "agile" is an adjective and so can be recognised as a quality that a substance-quality category can have, which is basically an inadequate concept, and therefore a weak proposition (weak relative to talking directly about whatever events are proposed, which is the only thing anybody can actually do).

The next thing I did after reading Whitehead was to try to identify the events of event processing. I don't mean the domain events that may be consumed or produced by the event processing, but rather the process or event processing itself, understood as a sequence of processing events. What is perceived after the process event has passed is whatever records it leaves behind, and so to make a reliable system, all that is needed is for the facts resulting from the processing event to be recorded atomically (so they are either all recorded and the processing is perceived to have happened once and correctly, or alternatively nothing is recorded and there is no perception that anything was processed). And counting, you also need counting. Anyway, I'd got to the same conclusion by using Deleuze and Guattari's formula in Anti-Oedipus: "consumption with recording determines production". Deleuze read Whitehead's book, and said it was a masterpiece, and I think his formula amounts to the same thing as Whitehead's conception: Deleuze and Guattari's formula is really a reformulation of what Whitehead said.

Anyway, that's now my starting point for considering any distributed system, or component of such a thing, since you can get to any other design for propagating and projecting application state by deviating to a greater or lesser degree from the "process event" pattern. The CDC thing seems like a way to generate database row change events. That might be all you can do with traditional CRUD ORM database tables, and with the database you would have after by following the Pattern of Enterprise Application Architecture. But those row events aren't atomic domain events and, worse still, they aren't atomic process events, and it's atomic process events that we need to make a distributed system reliable. So I can imagine consuming a CDC stream in an eventsourcing application, writing database row events into an event sourced application that effectively replicates the state of an application into an event sourced domain model that has one event called RowEvent. That could in turn be projected into the original database schema, so that the database is effectively replicated (but unless there was some other benefit, it would perhaps be better to do that directly, rather than using a domain model coded with this eventsourcing library). However, such event-sourced domain model events could be processed to detect fraud or generate reports or something. So I would think there would be some value in supporting that. I guess the table row events could be used to inform an event sourced model of the same domain, but it seems like a difficult position to escape from, where simple things will be straightforward, but more complicated operations will make things quite difficult, and some things will perhaps be completely intractable.

In general, without having been explicit about "what happens", trying after the fact to identify what happened will involve time and energy to eliminate disorder that was probably generated unnecessarily. We can understand that happens now simply because the modelling analysis ("thinking") was conditioned by the inadequate substance-quality categories of the Greeks, rather than by Whitehead's monadology in which the actual entities are occasions of experience (events). Event sourcing and event storming, and actually the original conception of pattern language, were all indirectly (but definitely) conditioned by Whitehead's concept of the event. It just seems to be a fact that most of us simply don't happen to know that. That's why we hear that event sourcing only applies in some cases. It's false timidity. In fact, CRUD simply does not make for state that can be readily propagated (hence CDC?).

That's more or less how I feel now when I read Fowler or Booch etc. It would have been better if they had read Whitehead when they were inventing OO and publishing their questionable patterns in copyright books. Alan Kay seems to intuit some of Whitehead's concept, for example when he talks about messages that are not sent to a particular recipient. And the software object patterns are generally good. But the whole subject is conditioned by a classical analysis that is mistaken, and which was repairs by Whitehead in the 1920s, so I think we can do better. We can be polite and entertain their confident propositions as receptive students, but it seems to me they are in fact both lacking and mistaken in their metaphysics. I've started putting some of this to some of them on Twitter, and I was feeling bad that none of them have anything to say about it, but what's striking is that all have been unable to respond. Happily enough, there are a small number of people who do seem to understand this stuff, so I don't think it's a big risk to pursue this approach for a bit longer, and see where it goes. So that's what I'm going to do.

My latest thought is that we should write a new approach to process, that's called The Eventual Approach, and just make everything be about events (and then abstract from that to get everything we are familiar with currently). Just like Google search results for "process event" refer only to functions, the Google search results for "eventual approach" refer to "the process that you end up with", which is perfect, but it seems nobody has taken this word-phrase as a name, as a noun. It's informal, but I think we can be rigorous about what events we want to happen, and describe them. The pattern language form is actually perfect for this, which was a surprise to me (Alexander was influenced by Whitehead a lot, and even quotes him in at least one of his books).

Anyway, that's just an update after the podcast. :-) I was thinking about messaging the podcast guy, to tell him I figured out the library wasn't finished and added some important things.

I didn't look at this Kafka Connect thing before, so it would be good to fund out more about it. The Kafka Connect seems to be about reading data from Kafka? The impression I had is that Kafka can't function as a record manager in this library, because you can't do atomic transactions across different sequences. I'm sure that is true for Cassandra. I think it's also true for EventStore, but it's been hard to find out. ACID databases aren't distributed. I think VaultDB is both distributed and offers ACID transactions across partitions, but its documentation says that feature is slow. So basically, we don't have a distributed database that can atomically store process events that involve all three of: tracking upstream position, zero-many new domain events in aggregate sequence, and zero-many new notifications in application sequence. I didn't try storing the process event directly, but that could be a technique more suited to distributed persistence.

Sorry to write so much. More than happy to discuss this further. Sorry if I overlooked anything! Thanks again for making contact. If you need help actually using this library to write Python code, just let me know.

okeyokoro commented 5 years ago

That was a lot to unpack, and the metaphysics caught me by surprise...I wasn't expecting the line demarcating philosophy and computer science to blur in a GitHub issue 😅. So a lot of it went over my head.

I would appreciate it if you could describe how you currently build event-driven systems (with a microservices architecture) using this library. Is Cassandra still your choice for an eventstore? If you needed to support stream processing how would you approach it?

johnbywater commented 5 years ago

I use Cassandra but the LWTs mean you can't do cross table ACID transactions, so you can store the events of an aggregate, but it's hard (not really possible in general) to propagate the state of such an application in a "reliable" manner. So now I tend to use a relational ACID database such as SQLite, MySQL, and PostgreSQL.

I tend to use them through SQLAlchemy, but sometimes with Django: the library has record managers for both ORMs. I also use the "plain old python objects" record manager, for example when running a simulation of a system that will run at normal speed against a normal database, because it's very fast, allows the model to behave in exactly the same way, and if a simulation is affected by "infrastructure failure" then it can be run again from the start.

With this library you can make a cluster of cooperating microservices using the System stuff, described in the "Distributed system" section of the library docs: https://eventsourcing.readthedocs.io/en/stable/topics/process.html

If you had in mind the RPC PUSH style of doing microservices, where services call each other to invoke functionality, then it's probably worth reading the "Distributed system" section to see why that isn't a very good way to make a reliable distributed system. But anyway if you wanted to use event sourcing within such a system, then you can just make a normal event sourced application and use it behind an API, and call into that when clients make requests.

The System stuff works by each application processing a stream of events (notification log) from upstream applications. An external sequence of events can be processed by reading that sequence with a reader of some kind, and then calling a process application policy for each event. If you resume reading the stream from position saved in local tracking records, assuming they are stored atomically with whatever projection is updated by processing the events, then the processing will be reliable.

okeyokoro commented 5 years ago

This is great @johnbywater, you've given me all I need to get my project up and running (and some great advice on how to keep it running reliably) I'm heading over to the documentation. Thanks 😁.