Make architecture more coherent for data storage

azaroth42 commented 6 months ago

At the moment we have caches of records in postgres (fine) and big tables of lookups in redis (fine), but we also have Sqlite for label and URI based reconciliation. And then we have some URI reconciliation tables in redis. Sqlite is just an sql implementation ... so it could equally migrate into postgres?

Figure out where the different types of data lookups should live, and then fix the implementation to do that.

azaroth42 commented 5 months ago

For data currently managed in SQLite, we should see significant performance gains from using LMDB (via lmdbm): https://pypi.org/project/lmdbm/

For 1M random reads, LMDB took 10 seconds, whereas Sqlite took 447.

azaroth42 commented 5 months ago

Storage proposal:

Key/Value that needs multiple writers goes to redis.
- Thus idmap, reference tracking, network result tracking
Key/Value that can be written once and then needs fast read goes to lmdb.
- Thus reconciliation, differentFrom, or other static tables
Documents go to postgres.
- Thus existing caches, but not key/value

Rationale:

Multiple processes writing to LMDB can cause the file size (and thus memory footprint) to grow constantly as the different writers don't have access to others' pages. So we can't use it for a redis replacement easily. It's designed for values up to 2k bytes, so also not good for documents, as it'll cause paging across single values. It has set_range() (ala BerkeleyDB) and hence can cursor through values, which will enable missing death dates to work again.

Redis is very fast and supports multiple writers and readers. Could be distributed over the network if needed rather than local.

Postgres is good at dealing with tons of data (1.5Tb currently) and has native JSON tables. Easy to add metadata fields and indexes. Could swap for CouchDb, CouchBase, CrateDB, MongoDB or others ... but no reason to believe there'd be significant performance gains from doing so. Can be distributed or local as needed.

Changes to make:

[x] Implement LMDB based reconciliation store
[x] migrate redis distinct and equivs to LMDB (along with all the existing ones)
[ ] determine if we need networkmap and redirects tables in redis
[ ] further consider record refs tracking vs ML queries

No Change:

idmap, networkmap, all_refs, done_refs stay in redis

project-lux / data-pipeline

Make architecture more coherent for data storage #28