RFC: Possible architecture simplification

After looking at https://github.com/sid-project/sid/pull/114 and reading comments on that thread, I'm wondering if something like this would work?

SID gets executed
SID immediately forks (independent of double fork for becoming a daemon stuff) before resources etc. are brought in/created
Parent is the process that contains the in memory DB and the main systemd event loop
Child process handles udev events, forks and execs the modules which act as clients against the DB using a client/server RPC API.

This prevents multiple instances of the DB, having to merge updates between database instances, prevents duplication of the entire systemd event loop etc., and the problem of cleaning up the call stack etc. Concurrency to the db is managed via the parent process handling of RPC API interface. Also depending on how the API is designed/written you could possibly even get away without requiring a full ACID compliant DB. If the persistence part is only for hints I think this could be workable.

Well, the primary reason for pursuing the current logic where we have an instance of the DB ("snapshot") in each worker process besides the main instance of the DB is that:

We want to have a consistent view of the whole database.
We don't know in advance which keys/records we are going to use during processing, hence we don't know in advance which locks to take even if db locking could be per-record.
We can't take a global lock because that way we'd practically completely serialize the processing of the events - this is something we have to avoid.
There are records, while processing the event in the worker, which are important, but only for the duration of that event processing and executing modules; or temporary records that modules could store in the db. This includes exposing udev environment as records within udev namespace (SID db has 4 namespaces - Udev, Device, Module, Gobal - these are simply prefixes for db keys). This means the access to information is uniform in modules (no special case for udev and temporary records, no rigid pre-defined structures passed from one phase of processing within the same or even among modules. All is just key-value pairs passed in one central db (well, snapshot of that db in the worker). Then, the mdoule can mark which records are supposed to be synced with main db (in case of D, M and G namespace ) or which are supposed to be exported to udev (in case of U namespace). Simply, it's uniform access to records.
As for the hints - actually, not all of the records could be just hints when persisted - they are when we consider records for looking into how the stack looked like before reboot/crash/daemon restart, but then there are custom records created by modules which the module can trust, adding its own checks for validity. If such records are available, the module can decide to avoid more detailed scanning and just rely on already stored information. SID provides basic device tracking/record keeping, other records are added by modules themselves to the DB and these might not be just hints for these modules. (The persistency over reboots vs. initramfs is a separate chapter to solve though.)

The issue with left-over call stack in forked process - that's something I haven't yet looked into in much detail. But I still believe there must be a way how to get rid of it. Of course, this requires some support from the event loop library too - at least having a possibility to clean up/close all the inherited resources.

(If we wanted to have the only one DB, without real snapshots, we might as well put a DB file inside ramfs like /run and then open/map it in each worker process, taking proper locks when reading/updating the DB. So we could probably even avoid the RPC. But there are those requirements as noted above which lead me to following the "separate snapshot DB for each worker" way.)

I think the disparity here is in that we have differing views about what a "DB" is. When I see DB, I think of an actual transactional database management system (MySQL, PostgreSQL, Oracle, MS SQL etc.). When I read: https://github.com/sid-project/sid/issues/118#issue-1121825751 I see that "DB" does not fit what I consider an actual database. I now understand the problems you're trying to workaround by using something like tkrzw.

Well, the primary reason for pursuing the current logic where we have an instance of the DB ("snapshot") in each worker process besides the main instance of the DB is that:
* We want to have a consistent view of the whole database.

An actual data base provides "Read Consistency", actually many support a number of different levels/variations of this.

* We don't know in advance which keys/records we are going to use during processing, hence we don't know in advance which locks to take even if db locking could be per-record.

With an actual database you don't specify locks. That's handled for you with what is needed to fulfill the constraints of the DB. Better DB implementations provide row level granularity of locks, but you need not worry about it.

* We can't take a global lock because that way we'd practically completely serialize the processing of the events - this is something we have to avoid.

Again, not needed with an actual database

* This includes exposing udev environment as records within udev namespace (SID db has 4 namespaces - `U`dev, `D`evice, `M`odule, `G`obal - these are simply prefixes for db keys).

In database design, this would violate 1st normal form.

The issue with left-over call stack in forked process - that's something I haven't yet looked into in much detail. But I still believe there must be a way how to get rid of it. Of course, this requires some support from the event loop library too - at least having a possibility to clean up/close all the inherited resources.

IMHO I think it's still better to avoid this situation to begin with.

There are small embeddable DB's that can give you a lot of functionality without a lot of overhead which I believe would solve all of the things you're trying to workaround with using tkrzw. For example, sqlite. Perhaps something to consider?

I think the disparity here is in that we have differing views about what a "DB" is. When I see DB, I think of an actual transactional database management system (MySQL, PostgreSQL, Oracle, MS SQL etc.). When I read: #118 (comment) I see that "DB" does not fit what I consider an actual database. I now understand the problems you're trying to workaround by using something like tkrzw.

OK, we can call it a "store" if you prefer and is less confusing. Actually, the wrapping resource in SID is called that way.

Well, the primary reason for pursuing the current logic where we have an instance of the DB ("snapshot") in each worker process besides the main instance of the DB is that:
* We want to have a consistent view of the whole database.
An actual data base provides "Read Consistency", actually many support a number of different levels/variations of this.

Sure, that is clear thing from theory... But I need to consider "write" as well. And here, inside usual databases, either the whole database is locked or more granular locks are held, but still locking the record itself at least. And then any others trying to access would simply need to wait for the lock to get released.

The processing of an event (which is always related to a single device) in SID consists of several phases and in each phase, different modules can process records (add, update, remove). The set of records which can be changed by certain module is quite limited. If there's a record which is shared out of a module with other modules, it is protected by ownership or the operation itself just adds/removes the single device that is just being processed to/from a shared list. The only exception are future devices that certain subsystem is waiting on based on metadata scanned - but they're not real devices yet, just something we expect to appear and wait for its real event. Events for the same device can never be processed in parallel, they are always queued - this is assured even at lower level - in udevd itself (and SID is layered on top).

While the processing in SID takes several phases, switching modules to execute, we need to consider this as long-running operation with respect to the store access. Now, each module can read and also write. These writes must be visible within processing the single concrete event, during all the phases, among all the modules and still separated from processing events of any other device that may be happening in parallel. I don't want to take a long-running "write" transaction (that in turn, inside the db, would take a form of a lock) that would lock others - I really can't do this.

So to resolve this, I either need to:

use usual db, take only a read transaction and put any changes ("writes") aside, e.g. in a rigid in-memory structures or temporary database instance that would form an overlay on top of the original database and somehow synchronize the state at the end;
or I take a snapshot and read and write to this snapshot, not affecting the origin and then collect all the changes throughout processing and send all the changes we need to sync with origin at the very end.

I decided to take the "snapshot + collecting changes + syncing at the end of processing" approach. I use in-memory store that is automatically snapshotted after fork. And that is, in it's essence, a COW, so quite efficient.

Note that we do have a way to detect whether there was incoming change in between starting the snapshot and syncing the collected changes - there are indicators for this (event sequence numbers, generation numbers I'm wrapping each record with...). So we can still detect there was a change in between and react to that (in the worst case, repeating the processing of the event). But as I said, the set of records which can be changed by processing the one single concrete event for a concrete device is limited, so that situation is minimized.

* We don't know in advance which keys/records we are going to use during processing, hence we don't know in advance which locks to take even if db locking could be per-record.
With an actual database you don't specify locks. That's handled for you with what is needed to fulfill the constraints of the DB. Better DB implementations provide row level granularity of locks, but you need not worry about it.

Of course. I meant what the DB itself would take inside to perform the transaction. I do need to worry about how DB does this and how it would lock because I have events coming in parallel/being processed in parallel and I really can't afford to wait for one transaction to complete before starting another one - I can't serialize this way - that's going against the nature of the event processing. The worst case is with databases which simply lock complete DB when a write transaction is held. Simply, we need to look at this from a different perspective that the usual rigid DB transactional and relational way. Sure, some databases support more granular locks, but still the locks are there and usually, such DBs with fine-grained locking are the more advanced ones which are not really embeddable with tiny footprint.

* We can't take a global lock because that way we'd practically completely serialize the processing of the events - this is something we have to avoid.

Again, not needed with an actual database

...yeah, as noted above, I didn't mean taking the lock myself, I meant the database itself taking a global lock...

* This includes exposing udev environment as records within udev namespace (SID db has 4 namespaces - `U`dev, `D`evice, `M`odule, `G`obal - these are simply prefixes for db keys).

In database design, this would violate 1st normal form.

I'm not considering use of relational database. Also, the values which I store can be either single or vector values. There's an API in SID's kv-store.h for working with vector values. The layer above that using this right now (the code in ubridge.c to handle the processing of incoming events) is adding logic to detect even changes for the vector values and collecting differences while the snapshot of the store is taken.

The issue with left-over call stack in forked process - that's something I haven't yet looked into in much detail. But I still believe there must be a way how to get rid of it. Of course, this requires some support from the event loop library too - at least having a possibility to clean up/close all the inherited resources.

IMHO I think it's still better to avoid this situation to begin with.

Yeah, that would be ideal. Unfortunately, I wasn't aware that systemd event loop library can't do at least the cleanup after fork. I started using it because I'm already linking against udev (which is in systemd's hands too) so it seemed natural to me to use that event loop library. And it had all the hooks I needed. Then I noticed there was some quirk with trying to call these cleanups but I had to continue with other things and not spend much time inspecting and delving into this. Then, when I had time for that, I noticed the that it's refusing to cleanup after fork because all the functions are checking the PID the structure got created and current PID.

For me, just closing the inherited FDs after fork and resetting the state would be enough, exiting the call stack properly. Which is, I think, something it should be able to do...

(Then, just for my curiosity, I looked how udevd does that because it spawns processes too. So here, the main loop uses systemd event loop and then at certain point with a considerable call stack, it forks. It keeps the call stack in the child process as it got inherited and creates its own new simple epoll (or whatever else it was) event loop there. So even project like that doesn't care about the inherited call stack - which I'm not saying I like, it's just they hit their own restrictions I think...)

There are small embeddable DB's that can give you a lot of functionality without a lot of overhead which I believe would solve all of the things you're trying to workaround with using tkrzw. For example, sqlite. Perhaps something to consider?

I considered sqlite at the very beginning, well, because I was raised with relational transactional databases like that and this was just first thing I could think of. But as I realized later, the rigid form of relational DB doesn't quite suite my needs. The records I need to store are not with rigid pre-defined schemes. For example, each module can store any information and the design would need count with the modules providing their own db schemes. Also, values can be vectors (sets/lists...) of different sizes. And that would get really complicated. The key-value store is something that much better fits this (....well, a graph database would probably fit event more, but when trying to look for some small embeddable one, I wasn't really successful). Also, you have much more chance to really find an embedable small-footprint db in the area of KV "stores".

Also, as for the sqlite, not sure if I recall it correctly now, but isn't it one of those that take simple global lock inside for any write transaction?

The tkrzw - well, not a workaround at all. I'm just looking for a KV store that can provide me a clean API for storing and retrieving values, the store being in-memory (because of the snapshot I'd like to take and the easiest approach is to just simply reuse the COW feature of fork) and being able to do range selections and do updates atomically (for me, the "updates" means "synchronizing the changes sent from snapshots in main store"). The last two things is something I don't currently have with the simple straightforward hash backend where the keys are shuffled. The tkrzw provides B-tree backend that has the keys ordered and hence allows for the range selections and basic atomicity (...and the "tokyo cabinet" and "kyoto cabinet" to which it is successor, are quite known in embeddable KV store world, just like sqlite in relational databases world).

The hash was used as a KV store backend for the prototype with the intention for it be replaced/extended by a backend providing other features needed (like that range selections now). The project is evolving in this step-by-step manner, because the development is done by me and just a few (1-2) volunteers which are coming (...and going :)).

For example, each module can store any information and the design would need count with the modules providing their own db schemes. Also, values can be vectors (sets/lists...) of different sizes. And that would get really complicated.

Sqllite supports blob type, you could put whatever you wanted in it to allow modules the ability to store arbitrary data.

Also, as for the sqlite, not sure if I recall it correctly now, but isn't it one of those that take simple global lock inside for any write transaction?

It's a little more advanced than that, but not as advanced as the larger more complex implementations which have low level locking granularity.

I'm having a hard time wrapping my head around all the focus on concurrency concerns. This problem seems IO bound to me and having anything block for any length of time would only be caused by holding a transaction open while interrogating disk signatures, which I think you could avoid. My initial thought is this problem would be well suited to an async/wait solution.

For example, each module can store any information and the design would need count with the modules providing their own db schemes. Also, values can be vectors (sets/lists...) of different sizes. And that would get really complicated.

Sqllite supports blob type, you could put whatever you wanted in it to allow modules the ability to store arbitrary data.

...and with that, I'd be losing most of the benefits of relational database - it'll boil down to key-value store - which is what I'm using right now.

Also, as for the sqlite, not sure if I recall it correctly now, but isn't it one of those that take simple global lock inside for any write transaction?

It's a little more advanced than that, but not as advanced as the larger more complex implementations which have low level locking granularity.

Ah, yes, I remember reading this once... But still, the locking is there. But that is perfectly fine and understandable. It's just not what I'm looking for.

I'm having a hard time wrapping my head around all the focus on concurrency concerns. This problem seems IO bound to me and having anything block for any length of time would only be caused by holding a transaction open while interrogating disk signatures, which I think you could avoid.

It's not just about taking a transaction while the disk is scanned and actually it's not only about the scanning part (though this one is important one).

The infrastructure in SID aims to provide a consistent view throughout the whole processing phases of the single event where we switch into different modules. Scanning is only one part of that, but there might be logic attached which can result in storing/accessing the info and sharing in between calling out to modules (e.g. blkid does basic device scan, writes the result to the store (to the snapshot only at this moment), based on this info we switch into a module handling the specific subsystem which in turn can do more scanning of current device or read associated config and based on that write more information (again, to the snapshot), and/or reading other properties which in turn can trigger further actions and at the very end, we take collected changes and sync with main store).

So I don't want the store state to be interleaved among processing the events. It's one event --> one snapshot of the store --> one set of changes with respect to the state the snapshot was taken --> one sync point of the collected changes from processing the event throughout all phases and modules. This makes it clear to follow and see the change of the state in one go for each event.

My initial thought is this problem would be well suited to an async/wait solution.

You mean initiating a scan from a module and then wait for that while other modules can do other things? Then a bit later, receiving results and writing them to the store? Well, thing is, the information could be passed from one module to another or the records can be a deciding point on next steps to take (like that blkid info). That would be hard to manage, if at all. If it was just the scanning and storing and nothing else, then OK, but there are other actions associated with the results, chained in a way...

My initial thought is this problem would be well suited to an async/wait solution.

You mean initiating a scan from a module and then wait for that while other modules can do other things?

I'm referring to: https://en.wikipedia.org/wiki/Async/await . However, that requires language support.

sid-project / sid

RFC: Possible architecture simplification #119