RFC: Offline DB/API revamp

Harjot1Singh commented 4 years ago

Problem

The repository is built into a relational database, in the form of an SQLite output, and Docker containers for Postgres + MariaDB.

This is great for a huge number of use-cases, however, the data (imo) is intrinsically document-based. That is not to say that relations do not exist, but that they're only useful for gleaning insights if you were querying the database blind and did not understand what it contains, which is often not the case.

In practice, this means that we end up transforming the data into a relational format, but to consume the data in a useful manner, we have to join it back together during runtime. This seems a little pointless and is the source of slowness.

Proposition

I can see the flexibility of the relational database being used in applications to do with data exploration and visualisation, so there is definitely a valid use-case to keep supporting this. But the majority of the time, the database is leveraged in a search-oriented fashion.

So, I propose that we provide a NoSQL/document-based format, which will facilitate high-speed querying offline, whilst maintaining our relational formats.

We'd want to expose this functionality through @shabados/database and expose any of our search features as wrappers, as we currently do. I would be personally happy with dropping the SQLite dependency from the JS library entirely, if need be.

Other options

Denormalisation: I did consider producing a denormalised SQLite output. But I suspect it'd be far too large for offline storage on consumer devices.

Design Considerations

Doesn't require too much more space than our current DB (~150MB normalised)
Fixed, local dataset, so write speed/concurrency/scale issues do not matter
Full-text/fuzzy is a bonus (but a full API should really deal with this)
Prioritising ease of use and speed regarding querying shabads, banis, and lines, and anyone who'd like to use it for other things can use the SQLite build to do so
Differential updates/sync in the future, so there could be an API server or master instance sitting somewhere, and streams DB updates to everyone
Useable on desktop and mobile
Local JS API publishable in a similar manner to the existing @shabados/database
Partial sync

Please post any additional considerations.

Solutions

Comments for suitable technologies and products would be appreciated.

bhajneet commented 4 years ago

I'm a big fan of this. Have you also considered a supplmentary lmdb for enhanced performance? Something specific to desktop/mobile, and perhaps not for other database users.

Check this out https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database it's a key value but with improvements to narrow down the key faster. If we were to use this for all potential search queries it would be gross, we'd be talking in the gigabytes of storage, however if we were to use it smartly (perhaps using serialized ids/numbers for values and incomplete lists), we may be able to squeeze out even further performance than a document based db would.

bhajneet commented 3 years ago

Differential updates/sync in the future, so there could be an API server or master instance sitting somewhere, and streams DB updates to everyone

Will we control which sources are included in this? Can the developer only update/sync a few sources/publications for a variable number of compositions? Can the user make offline only a few sources ? Etc.

Harjot1Singh commented 3 years ago

Notes so far

Search indexes use a LOT of storage, if you pregenerate the possible results for every query (about 86MB just for SGGS, with only line IDs attached).
- 2.5ms for fetching 50 results from the search table in SQLite. About 33ms with a LIKE query instead.
- Compression would make repeating/denormalisation much more palatable.
A PouchDB is a single index. You need multiple to store different "views".
- It's not amazingly fast (40ms) to fetch the list of results as ids from the index and then request the full docs from another PouchDB.
- It is very fast to fetch everything in one shot (5ms?)
Denormalising SQLite entirely to reduce joins speeds up fetches (6-15ms) at the expense of storage (600mb including transliterations + translations, 158mb with translations, 20Mb for just Gurbani). Joining ONLY translations and transliterations 158mb to denormalised lines.

Harjot1Singh commented 3 years ago

Perhaps we only need to solve one issue, aside from dropping the ORM. The smart sorting:

select * from `line_content` inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `line_content`.`first_letters` like '%a%' limit 50
op: 12.902ms

vs

select * from `line_content` inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `line_content`.`first_letters` like '%a%' order by CASE  WHEN first_letters LIKE 'a' THEN 0 ELSE 1 END asc limit 50
op: 297.052ms

the CASE WHEN sorting is pretty slow... Maybe giving 100-150mb to a search index isn't such a bad idea...

Harjot1Singh commented 3 years ago

with generated serach index:

select * from `search_only` inner join `line_content` on `line_content`.`line_id` = `search_only`.`line_id` 
inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `search_only`.`query` = 'hggm' limit 50
op: 11.707ms

saihaj commented 3 years ago

Maybe giving 100-150mb to a search index isn't such a bad idea

For offline support if we plan to bundle this @shabados/mobile wouldn't that just make app size even bigger? Is there a way we can split into small chunks to keep it smaller?

Harjot1Singh commented 3 years ago

Indeed! Which is why Partial Sync support is planned for the next release!

shabados / database