project-lux / data-pipeline

Data pipeline to harvest, transform, reconcile, enrich and export Linked Art data for LUX (or other system)
Apache License 2.0
13 stars 1 forks source link

Make architecture more coherent for data storage #28

Open azaroth42 opened 6 months ago

azaroth42 commented 6 months ago

At the moment we have caches of records in postgres (fine) and big tables of lookups in redis (fine), but we also have Sqlite for label and URI based reconciliation. And then we have some URI reconciliation tables in redis. Sqlite is just an sql implementation ... so it could equally migrate into postgres?

Figure out where the different types of data lookups should live, and then fix the implementation to do that.

azaroth42 commented 5 months ago

For data currently managed in SQLite, we should see significant performance gains from using LMDB (via lmdbm): https://pypi.org/project/lmdbm/

For 1M random reads, LMDB took 10 seconds, whereas Sqlite took 447.

azaroth42 commented 5 months ago

Storage proposal:

Rationale:

Multiple processes writing to LMDB can cause the file size (and thus memory footprint) to grow constantly as the different writers don't have access to others' pages. So we can't use it for a redis replacement easily. It's designed for values up to 2k bytes, so also not good for documents, as it'll cause paging across single values. It has set_range() (ala BerkeleyDB) and hence can cursor through values, which will enable missing death dates to work again.

Redis is very fast and supports multiple writers and readers. Could be distributed over the network if needed rather than local.

Postgres is good at dealing with tons of data (1.5Tb currently) and has native JSON tables. Easy to add metadata fields and indexes. Could swap for CouchDb, CouchBase, CrateDB, MongoDB or others ... but no reason to believe there'd be significant performance gains from doing so. Can be distributed or local as needed.

Changes to make:

No Change: