Open trentmc opened 5 months ago
I agree in general that all services (including different agents/bots) would benefit from having a lake that's just up-to-date. And having a process that's solely-responsible for doing this is the way forward.
I think there might be other approaches like "swapping tables", or updating a pointer to the latest table, that might more productive to implement than locking.
What I originally considered was building a base table.py
object, that would abstract the schema, return the df, point to a file, etc... The basic structure can be found on table_pdr_predictions
, table_pdr_subscriptions
. Anything that reads from the lake, would do so through the Table() interface, not DataFactory(). This way, DataFactories are operating on their own, updating the lake, while components/users can access via the interface.
DuckDB only lets you have 1 writer process at a time that holds the db writer connection. Within this, you can then have multiple threads/operating on it. So, for the duckdb "container/process/vm" we should make it as big as possible.
There is now a task for making sure that Lake/ETL has an "update process" that sits there indefinitely looping and updating the lake #1107
Forward Looking:
Background / motivation
Approach 0: each app writes lake, in its process. Predictoor bot updates the data lake, then uses the lake, in time for making predictions. Same for other apps in pdr-backend.
Approach 1: separate lake, started separately
pdr lake
. It's constantly writing to the data lakeBut we can do better yet, leveraging the database concept of "locking" which enables >1 writers without hurting DB safety. Writers must handle contention due to locks, eg by waiting.
Approach 2: allow >1 writers.
pdr lake
inside the app. Eg user starts onepdr predictoor
process (and nothing else). The predictoor bot will detect whether a lake process is running, and start one if needed.pdr lake
separately. Eg a user startspdr lake
, then 20pdr predictoor
processes, one for each feed to predictpdr lake
process, and it starts >1 threads. Eg user starts 1 process, then later one, a different one with different goals. Eg >1 users start different processesApproach 2 is endgame. The benefits compared to 1 are immense, let alone 0.
Q: Should we go from 0 -> 1 -> 2, or 0->2 directly?
TODOs
pdr lake
separately (at the end of the README)