oceanprotocol / pdr-backend

Instructions & code to run predictoors, traders, more.
Apache License 2.0
24 stars 17 forks source link

[Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes #564

Open trentmc opened 5 months ago

trentmc commented 5 months ago

Background / motivation

Approach 0: each app writes lake, in its process. Predictoor bot updates the data lake, then uses the lake, in time for making predictions. Same for other apps in pdr-backend.

Approach 1: separate lake, started separately

But we can do better yet, leveraging the database concept of "locking" which enables >1 writers without hurting DB safety. Writers must handle contention due to locks, eg by waiting.

Approach 2: allow >1 writers.

Approach 2 is endgame. The benefits compared to 1 are immense, let alone 0.

Q: Should we go from 0 -> 1 -> 2, or 0->2 directly?

TODOs

idiom-bytes commented 5 months ago

I agree in general that all services (including different agents/bots) would benefit from having a lake that's just up-to-date. And having a process that's solely-responsible for doing this is the way forward.

I think there might be other approaches like "swapping tables", or updating a pointer to the latest table, that might more productive to implement than locking.

What I originally considered was building a base table.py object, that would abstract the schema, return the df, point to a file, etc... The basic structure can be found on table_pdr_predictions, table_pdr_subscriptions. Anything that reads from the lake, would do so through the Table() interface, not DataFactory(). This way, DataFactories are operating on their own, updating the lake, while components/users can access via the interface.

idiom-bytes commented 1 week ago

DuckDB only lets you have 1 writer process at a time that holds the db writer connection. Within this, you can then have multiple threads/operating on it. So, for the duckdb "container/process/vm" we should make it as big as possible.

There is now a task for making sure that Lake/ETL has an "update process" that sits there indefinitely looping and updating the lake #1107

Forward Looking: