Event replay - Githubissues

ryan-mars commented 3 years ago

[ ] time range #52
[ ] event type #53
[ ] filter recipients #54

Should this be done by cdk-triggers during deploy or should we allow this to be initiated ad-hoc via some sort of API?

sam-goodwin commented 3 years ago

I think cdk-triggers is better because it is auditable in versioned code. We know exactly who did what when to production, as opposed to adhoc scripts

Proposed tenet: all operations are executed as declarative code deployed by a provisioning engine such as CloudFormation/Terraform/Pulumi.

This approach would also then enable us to vend operational constructs (such as Event Replay).

ryan-mars commented 3 years ago

Sounds great

ryan-mars commented 3 years ago

Newly developed read-models in an existing system may need to be historically accurate. Therefore, they will need to be "back-filled" with old events and "brought up to speed" when deployed for the first time. In order to protect the read-model from receiving events in an unexpected order, or live events interleaved with historical events, we cannot just pump all historical events into the read-model handler's queue. Here's a guess at how we could bring up a new read model without much trouble.

Assuming events originating within a bounded context are ordered (SNS/SQS FIFO or Kinesis) and have a lex sortable unique identifier (i.e. KSUID) a new read model could be deployed in two phases.

Backfill Phase

Deploy the read-model with a concurrency of 1 (unless overridden)
The lambda event source is NOT connected to its "live" queue
Allow new events to begin filling the "live" queue, they will not be processed in this phase
Fill the read-model's "replay" queue (new feature) with "historical" events from S3
When the latest "historical" event is newer than the oldest "live" event in the "live" queue we're ready to go live

Live Phase

Deploy the read model with its normal config
Skip events you've already seen
Profit!

You could think of this as a sequential lambda architecture.

sam-goodwin commented 3 years ago

Does this require two CloudFormation deployments?

If all events have a ksuid, why does it matter if we interleaved live with historical events? The events in the live queue won't be ordered by time, only ordered as they arrived in the fifo queue or kinesis stream. The difference would be the amount of time separating the events. Does that matter? I hope not.

It depends on the aggregation strategy of the read model. If it's commutative and associative, then we can just do it all in one go and the model will eventually be consistent.

This strategy would work with any monoid data structure since a monoid (by definition) obeys all of these laws. We can provide a bunch of primitives (like algebird) or depend on some library (like automerge). Developers can then compose these primitives to build up a complex data structure specific to their business.

I think this is the most scalable and simplistic strategy and we should pursue it until it fails. It may fall down in some cases, but perhaps those are rare and maybe they can be avoided in clever ways? This friction may end up forcing the developers to build more scalable and reliable architectures instead of relying on slow linear processes that require absolute time order?

We may see an ecosystem and community develop around this problem, which is why I'm particularly interested in a composable approach using monoids.

sam-goodwin commented 3 years ago

Correction: monoid doesn't require commutative. We would want commutative monoids.

http://www.cs.umd.edu/class/spring2019/cmsc388F/lectures/monoids.html

It's just a fancy word for a data structure that has an initial value and a combine method.

interface Monoid<T> {
  zero(): T;
  combine(a: T, b: T): T
}

If we require developers implement combine in a commutative way, we can get a very long way.

That said, if it isn't commutative, then I think your strategy would work well. Although I wonder if we can achieve it in a single deployment?

sam-goodwin commented 3 years ago

skip events you've already seen

This could be quite complex. It would require you have a record of every ID you've already seen.

sam-goodwin commented 3 years ago

Sorry, stream of consciousness. Two advantage of the two-phased deployment is that it:

Cleans up resources for the backfill
Requires the developer understand that the read model doesn't immediately take effect and that they should sequence things before using it in production.

There may be a third phase where the developer enables the read model for use in a Policy or Command.

ryan-mars commented 3 years ago

Does this require two CloudFormation deployments?

Yes, so long as we're committed to infra not changing outside the scope of CloudFormation.

If all events have a ksuid, why does it matter if we interleaved live with historical events? The events in the live queue won't be ordered by time, only ordered as they arrived in the fifo queue or kinesis stream.

It's all about DX. If SQS is feeding out-of-order events to a projection function the developer will be required to only use commutative operations in/on the read model's data store. Not all read models necessitate the complexity of "commutative/associative only". Alternatively, if stochastic can guarantee a total order of events, and I the developer know my domain well, I can write a very simple projection.

Zooming out, different domains will have different concurrency needs.

The simplest way for the developer to write a projection is for stochastic to guarantee a totally ordered set of events, and no partial replays. If I understand the concept properly such a read model would be a left fold over the events. There's no requirement for commutative operations. An example might be a read model for user profiles. Each user can only edit their own profile, it's non-collaborative. There are no race conditions, it's ok for last write to win. This is a very simple read model projection to write.

If total ordering cannot be achieved without unacceptable performance impacts because the domain is collaborative (multiple editors), or there is competition around physical resources (i.e. seat reservation) then the simple left fold won't work and the developer must use a more creative strategy.

The difference would be the amount of time separating the events. Does that matter? I hope not.

The duration of time between two events should not matter.

It depends on the aggregation strategy of the read model. If it's commutative and associative, then we can just do it all in one go and the model will eventually be consistent.

If we can make the DX really straightforward then yes, that would be ideal and would work for the simple cases too. Keep in mind with read models the developer will be dealing with a variety of data stores.

This strategy would work with any monoid data structure since a monoid (by definition) obeys all of these laws. We can provide a bunch of primitives (like algebird) or depend on some library (like automerge). Developers can then compose these primitives to build up a complex data structure specific to their business.

Let's prototype this.

I think this is the most scalable and simplistic strategy and we should pursue it until it fails. It may fall down in some cases, but perhaps those are rare and maybe they can be avoided in clever ways? This friction may end up forcing the developers to build more scalable and reliable architectures instead of relying on slow linear processes that require absolute time order?

Let's find out.

One of our design principles is to make simple things easy and hard things (distributed systems) possible.

If we can ensure absolute time order (by relying on vendor guarantees) then it's a nice crutch to have and it keeps the DX simple. However if the developer misjudges their problem space choosing the simple option will create problems for them. Perhaps starting from the premise that "events will never be totally ordered" and "partial replays are possible" could guide the design of stochastic in a better direction over the long term.

We may see an ecosystem and community develop around this problem, which is why I'm particularly interested in a composable approach using monoids.

That would be ideal. It's my understanding that it took some time for the Rails community to settle on ActiveRecord (persistence framework). If the developer has the competency to translate their event processing and read model persistence strategy into code, then it should fit into the framework in a predictable manner. We should provide at least one default.

ryan-mars / stochastic

Event replay #9