AlexGilleran commented 6 years ago

Intro

Currently Magda uses event sourcing, but it’s entirely limited to the registry - to work with event sourcing, data has to live in the registry and be changed by the registry. It’d be good to have this separate so we could use it for data outside the registry - e.g. when something in the Content API changes, we could notify other services that use it to update their cached value.

Rationale

System becomes more supportive of extension (anything can use event sourcing without having to put data in the registry)
Don’t have to have the split between read-only and full registry anymore
Having data and events in the same database in the registry has allowed us to do some complicated, cool stuff (like being able to subscribe to changes on aspect A that links to aspect B that links to a collection of aspect C, and still get events when aspect C changes). However this has lead to us desperately debugging performance problems in the database and doing some pretty dodgy index creation in order to keep it working. Having events and data in separate databases would enforce simplicity.
Event processing in the registry API is done by a lot of very complex code in a very resource intensive way, it’d be good to switch it to a less resource intensive runtime (NodeJS) and to non-blocking DB access
Event processing can’t be split across multiple pods, there has to be exactly one full registry api
Event listening can’t be split across multiple pods, webhooks are sent serially
In general we want to move away from Scala, this is a step towards standardising on typescript

Design

Event Broker

Event sourcing is now handled by central event broker pod(s) with their own database (separate from the registry)
Services can register new event topics
Other services can register subscriptions to a topic
Whenever a subscription is created for a topic, events posted to that topic also have a row created for each subscription, each of these rows have a status and last modified field

Getting Events from the Broker

Rather than have the broker send events directly to the subscribers as it does now, subscribers will pull events
- This isn’t as nice, but it has a lot of advantages especially enabling having more than one pod processing a stream of events concurrently
- It’s also how Apache Kafka works
A subscribing service pulls events from its subscription - when it does so, the event broker retrieves the next x events for that subscription and in the same transaction changes their status to “PENDING”.
- The next x events are ones that have status “NEW” or status “PENDING” but a last modified time greater than some retry threshold
When the subscribing service has finished processing those events, it posts back to the event broker to set the status to “DONE”
- It could potentially do this one event at a time if it’s slow
- We could also bundle marking old events as “DONE” and getting new events into one call
This means that you can have multiple instances of a service subscribed to a subscription, processing events in parallel - one gets a page of events and immediately those events are set to PENDING so when the other instance gets a page of events, it’ll serve the next page.
- What if they request data at the same time? The locking model of Postgres should handle blocking the second request until the first one completes, although I’m not sure if that’s the default.
We should probably start by having the subscribers poll the event broker at an interval, but potentially we could get more clever - have the broker API hold the connection open until there’s a new event (long-polling) or maybe even use websockets. We could also use postgres notify/listen to reduce the need for polling the database
If we want to send an event to multiple recipients (e.g. if we want to notify all pods in a deployment that they should bust their cache), they can listen to the topic without creating a subscription, and keep track of where they are in the event queue themselves, Kafka-style
- This works for the cache scenario, I’m not sure if there’s another one where we need to make sure that every event gets sent to multiple recipients - if so they’ll probably have to create one subscription each
Why not just use Kafka?
- Kafka requires subscribers to keep track of where they are up to in the event stream, which clashes with us using stateless services
- It’s also a pretty complex deployment (needs multiple kafka pods and zookeeper), it’d be better if we can just use postgres until we need kafka’s speed/availability

Sending Events to the Broker

This creates a problem where a service has to do two things at once - both mutating its own database and sending an event to the broker. If it fails to do one, it shouldn’t do the other, but because they’re happening in different databases across different services it’s impossible to put it in one transaction
I think we should use the Transactional Outbox Pattern for this. Essentially each database-backed service has a table for outgoing events. When something is changed in the database, an event is created in the same transaction. If either part of the transaction fails, both are rolled back. When an event is successfully received by the event broker, it’s deleted on the sending system.
How do we get events from this table to the broker? Potentially we could make every service do it manually, but also we could create a sidecar container for every pod that runs the same docker image.

Catching Up

I forsee this could work similarly to the way events work
The event broker will allow a “catchup” to be created. This “catchup” is linked to a number of “pages”, each of which have a token and a status
When a subscriber creates a new “catchup”, it’ll create a page with a null token and status “PENDING”
The subscriber will then retrieve the first page of whatever it’s reading. Immediately it’ll post back to the event broker with the token of the next page, which’ll be created with status “NEW”
Once it’s finished ingesting its page, the subscriber will get the next page from the event broker. If there isn’t one, it’ll wait
Similarly to events, the pages that come back could be “status = new” or “status = pending” and an event time over a threshold
Possibly the service could keep nudging the page as it went along to change the modified date and keep it from going stale
When the subscriber finishes the last page, it sets the catchup to “DONE”

catchup erd

External Webhooks

… will have to be done by some other service that listens to the event broker.
BUT at least we can distribute these services (i.e. have multiple webhook processors). Right now if we had external webhooks they’d all have to be handled by a single instance of the registry, and it already runs pretty hot

Diagram

new event sourcing 1

t83714 commented 6 years ago

I think it's a very good idea +1

aneesha09 commented 6 years ago

SPIKE to figure out the best architechture approach

kring commented 5 years ago

Makes sense to me. I was wondering, though: why create a separate row for each event/subscription pair? Instead of the registry's current approach of associating a "last event ID seen" with each subscription? There's a good chance you explained why and I just missed it.

AlexGilleran commented 5 years ago

@kring So you can have retries (although I might not have thought this through enough)

Say you have n events and two instances of the same minion consuming the event stream.

Minion 1 starts up and grabs events 1-10, they're set to pending for that subscription
Minion 2 starts up and grabs events 11-20, they're set to pending for that subscription
Minion 1 gets goes down (maybe the node failed or something)
Minion 1 starts up again, grabs events 21-30 because those are the next non-pending non-completed events. Events 1-10 are still in state pending
Minion 2 completes its page, sets 11-20 to done, grabs 31-40.
some time passes, etc, etc *
Minion 1 grabs the next page of events, but enough time has passed that the lastmodified timestamp on events 1-10 is greater than the retry threshold. So rather than get the next page of events (41-50 or greater), it gets 1-10, their lastmodified (maybe this should be called something different) is set to now() and they remain in state pending
Minion 1 doesn't crash this time and completes events 1-10, so they're set to done

Just keeping track of the last event id doesn't allow for that retry logic... but doing it this way means that events won't necessarily be processed in order. But I think that's OK - it just becomes encumbent upon the consumers of the events not to trust that the data in the event represents the most up-to-date state of the system... which it shouldn't be doing anyway because the event might be 3 months old for all it knows.

kring commented 5 years ago

Just keeping track of the last event id doesn't allow for that retry logic

Ah ok, that makes sense.

I think you could achieve that retry logic without duplicating every event for every subscription, though. I'm sure I haven't thought this through enough either, but I think you could store the ranges of pending events in a separate table.

So each subscription would have:

A "high water mark" of the first event that is not totally done.
A list of event ranges (past the high water mark) that are pending or done, and the lastmodified timestamp for the ones that are pending.

So when a minion asks for the next page:

check for event ranges that are pending and expired for that subscription, and return those if there are any.
if not, return the next range after the high water mark that is not already in the subscription's table.

When events are done being processed:

mark them as such in the subscription's table.
if possible, advance the high water mark and remove event ranges that are below it.

I guess that's a lot more complicated, so maybe it's not worth it. Storing multiple copies of every event makes me nervous, though. There are lots of events.

AlexGilleran commented 5 years ago

Storing multiple copies of every event makes me nervous, though. There are lots of events.

🤔 yeah. Although hopefully either way (whether keeping a row per event or row per range) it's safe to delete ones that are done. Although that poses a question as how can you be absolutely sure that you're creating a subscriptionevent (or subscriptioneventrange) for each subscription every time there's a new event 😬

kring commented 5 years ago

True, it's safe to delete the ones that are done. But still, if there is an event for every subscription, and there are 10 subscriptions, handling a new event (which connectors can generate, I dunno, tens to hundreds of per second, maybe more) requires writing 10 times as much stuff.

In the scheme I outlined above, I don't think it's necessary to change the subscriptioneventrange table for new events. It only needs to be modified when a minion asks for events to process or when it marks them done. This is good because it makes inactive subscriptions (where no one is processing the events) basically free. If a new event is created per subscription, inactive subscriptions are super expensive because their events never get deleted.

t83714 commented 5 years ago

Good job! @AlexGilleran Comprehensive & Very impressive design 👍 I like it -- especially the sending events to broker part.

just a few questions regarding the event db structure design:

Considering most event related use cases are expected to process events in order, would it make sense to leave the processing events in parallel feature so that we can remove the status on the event records and convert it to a much simpler first in first out event queue structure in database (thus, potentially could be operating faster)? -i.e.:
- only need a createTime on events for event expiring & no other management fields
- only one copy of the event record with a seperate table to maintain the event to subscriptions mapping
- When a webhook tells registry a list of events are done, those events are not associated with the subscription anymore (removed from the subscription).
- Once an event is not associated with any subscription, it will be simply delete from the event table.
  - reduce the no.of records on event table probably will help to make DB run faster
- Any events created older than certain threshold will be deleted as well by a registry routine
We probably can maintain a seperate event log facility for catch-ups feature considering it might not be a frequent operation
- this log probably can have a copy of all events create so far and never delete them
- maybe not in database but in a option that can handle large amount data easier e.g. elasticsearch
- We never access the log unless we have a webhook want to catch-ups expired events

magda-io / magda

Split Registry API from Webhook Processor #796

Intro

Rationale

Design

Event Broker

Getting Events from the Broker

Sending Events to the Broker

Catching Up

External Webhooks

Diagram