scylladb / argus

Apache License 2.0
4 stars 11 forks source link

Consider moving architecture to Event Sourcing #199

Open soyacz opened 1 year ago

soyacz commented 1 year ago

I wasn't taking part in design of Argus, and is quite late. But I have an idea of improvement and I cannot resist sharing it: lets progress Argus architecture to Event Sourcing. I think is not that hard to switch and can be done gradually with benefits along. Maybe it was think over - if so, would be nice to have an open conversation about reasons of denying it.

Briefly, idea behind event sourcing in Argus project is to make clients (SCT, dtest, others) just to send events to Argus database. Based on these events, Argus can make decisions like: test started, test ended, failed, take versions, etc. basically everything. Then create tables for reads to be more performant on UI (CQRS). We already have extensive events system in SCT, so no need to adapt SCT code much.

What we gain:

  1. Full separation between clients and Argus - so we don't need to add much code to SCT/dtest/others - just one library for sending events and make all events go to Argus database (can be implemented as single, stable Argus REST api endpoint if we don't want to play around with sharing db credentials).
  2. Much more resistant to changes (no need to specify any api/schema versions in api calls as there will be just one stable endpoint)
  3. Changing Argus code don't have to be aligned with clients
  4. Real time events lookup in Argus - so we can check test status in real-time (nice for long running jobs) and having all of them.

Problems:

  1. Future changes in events structure may affect Argus events processing. This may be tackled by creating new event types instead of modification and let Argus be able processing new event type.
  2. It wIll take some dev time - need to add processors for each event type, but will benefit in future with much less code across clients and adaptations and schema tracking.

Chow to switch:

  1. Keep Argus as is, just make clients send events in realtime to Argus db
  2. Add basic events processing, e.g. make them visible in real-time in runs
  3. Add detailed processing to each event class and remove one by one Argus api calls from clients (e.g. get Scylla version from event and remove api call from SCT)
k0machi commented 1 year ago

Briefly, idea behind event sourcing in Argus project is to make clients (SCT, dtest, others) just to send events to Argus database. Based on these events, Argus can make decisions like: test started, test ended, failed, take versions, etc. basically everything. Then create tables for reads to be more performant on UI (CQRS). We already have extensive events system in SCT, so no need to adapt SCT code much.

I see what you mean, that could really work well with SCT, provided the events also contain all the required information, but other frameworks are not as prepared for event sourcing, particularly driver matrix (it would require touching a lot of places starting from driver repo and ending with the runner itself to implement proper event system there I feel). As for SCT, there are a couple missing events and some events are not as descriptive as others (I don't think there's a lot of events for getting a node ready for example), so work will have to be done.

So far I've wanted to implement something like this separately for SCT - real-time event processing in addition to what we have right now, which could really help with tracking long running jobs as you've said, plus reduce the amount of movement you need to truly understand what's going on right now, as currently "created" and "running" statuses have a lot of dead air where you can't tell what's going on.

soyacz commented 1 year ago

I see what you mean, that could really work well with SCT, provided the events also contain all the required information, but other frameworks are not as prepared for event sourcing, particularly driver matrix (it would require touching a lot of places starting from driver repo and ending with the runner itself to implement proper event system there I feel).

I don't think we need to migrate other frameworks to event sourcing, just instead using different api's there will be one way of sending semi-structured event - basically e.g. event type, test id and data stored in json -> push it to Argus DB. Argus will take care what to do with it, what schema to use for UI, what decision to make (e.g. when to mark test as over), etc.

As for SCT, there are a couple missing events and some events are not as descriptive as others (I don't think there's a lot of events for getting a node ready for example), so work will have to be done.

Yes, we may miss some data in events, but assuming something is worth to push to Argus is worth to push as event. Gradually we can add more events to SCT.

So far I've wanted to implement something like this separately for SCT - real-time event processing in addition to what we have right now, which could really help with tracking long running jobs as you've said, plus reduce the amount of movement you need to truly understand what's going on right now, as currently "created" and "running" statuses have a lot of dead air where you can't tell what's going on.

That's true, added benefit will be less hassle around schema and internal changes in Argus - we'll get better decoupled systems.

If there're some other problems you see, let's organize a call so we can discuss in real time.