magda-io / magda

A federated, open-source data catalog for all your big data and small data
https://magda.io
Apache License 2.0
507 stars 92 forks source link

Split Registry API from Webhook Processor #796

Open AlexGilleran opened 6 years ago

AlexGilleran commented 6 years ago

Intro

Currently Magda uses event sourcing, but it’s entirely limited to the registry - to work with event sourcing, data has to live in the registry and be changed by the registry. It’d be good to have this separate so we could use it for data outside the registry - e.g. when something in the Content API changes, we could notify other services that use it to update their cached value.

Rationale

Design

Event Broker

Getting Events from the Broker

Sending Events to the Broker

Catching Up

catchup erd

External Webhooks

Diagram

new event sourcing 1

t83714 commented 6 years ago

I think it's a very good idea +1

aneesha09 commented 6 years ago

SPIKE to figure out the best architechture approach

kring commented 5 years ago

Makes sense to me. I was wondering, though: why create a separate row for each event/subscription pair? Instead of the registry's current approach of associating a "last event ID seen" with each subscription? There's a good chance you explained why and I just missed it.

AlexGilleran commented 5 years ago

@kring So you can have retries (although I might not have thought this through enough)

Say you have n events and two instances of the same minion consuming the event stream.

Just keeping track of the last event id doesn't allow for that retry logic... but doing it this way means that events won't necessarily be processed in order. But I think that's OK - it just becomes encumbent upon the consumers of the events not to trust that the data in the event represents the most up-to-date state of the system... which it shouldn't be doing anyway because the event might be 3 months old for all it knows.

kring commented 5 years ago

Just keeping track of the last event id doesn't allow for that retry logic

Ah ok, that makes sense.

I think you could achieve that retry logic without duplicating every event for every subscription, though. I'm sure I haven't thought this through enough either, but I think you could store the ranges of pending events in a separate table.

So each subscription would have:

So when a minion asks for the next page:

When events are done being processed:

I guess that's a lot more complicated, so maybe it's not worth it. Storing multiple copies of every event makes me nervous, though. There are lots of events.

AlexGilleran commented 5 years ago

Storing multiple copies of every event makes me nervous, though. There are lots of events.

🤔 yeah. Although hopefully either way (whether keeping a row per event or row per range) it's safe to delete ones that are done. Although that poses a question as how can you be absolutely sure that you're creating a subscriptionevent (or subscriptioneventrange) for each subscription every time there's a new event 😬

kring commented 5 years ago

True, it's safe to delete the ones that are done. But still, if there is an event for every subscription, and there are 10 subscriptions, handling a new event (which connectors can generate, I dunno, tens to hundreds of per second, maybe more) requires writing 10 times as much stuff.

In the scheme I outlined above, I don't think it's necessary to change the subscriptioneventrange table for new events. It only needs to be modified when a minion asks for events to process or when it marks them done. This is good because it makes inactive subscriptions (where no one is processing the events) basically free. If a new event is created per subscription, inactive subscriptions are super expensive because their events never get deleted.

t83714 commented 5 years ago

Good job! @AlexGilleran Comprehensive & Very impressive design 👍 I like it -- especially the sending events to broker part.

just a few questions regarding the event db structure design: