shabados / database

A digital representation of Sikh Bani and other Panthic texts with a public logbook of sangat-sourced corrections.
https://docs.shabados.com/database
GNU General Public License v3.0
63 stars 23 forks source link

RFC: Offline DB/API revamp #1794

Open Harjot1Singh opened 4 years ago

Harjot1Singh commented 4 years ago

Problem

The repository is built into a relational database, in the form of an SQLite output, and Docker containers for Postgres + MariaDB.

This is great for a huge number of use-cases, however, the data (imo) is intrinsically document-based. That is not to say that relations do not exist, but that they're only useful for gleaning insights if you were querying the database blind and did not understand what it contains, which is often not the case.

In practice, this means that we end up transforming the data into a relational format, but to consume the data in a useful manner, we have to join it back together during runtime. This seems a little pointless and is the source of slowness.

Proposition

I can see the flexibility of the relational database being used in applications to do with data exploration and visualisation, so there is definitely a valid use-case to keep supporting this. But the majority of the time, the database is leveraged in a search-oriented fashion.

So, I propose that we provide a NoSQL/document-based format, which will facilitate high-speed querying offline, whilst maintaining our relational formats.

We'd want to expose this functionality through @shabados/database and expose any of our search features as wrappers, as we currently do. I would be personally happy with dropping the SQLite dependency from the JS library entirely, if need be.

Other options

Design Considerations

Please post any additional considerations.

Solutions

Comments for suitable technologies and products would be appreciated.

bhajneet commented 4 years ago

I'm a big fan of this. Have you also considered a supplmentary lmdb for enhanced performance? Something specific to desktop/mobile, and perhaps not for other database users.

Check this out https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database it's a key value but with improvements to narrow down the key faster. If we were to use this for all potential search queries it would be gross, we'd be talking in the gigabytes of storage, however if we were to use it smartly (perhaps using serialized ids/numbers for values and incomplete lists), we may be able to squeeze out even further performance than a document based db would.

bhajneet commented 3 years ago

Differential updates/sync in the future, so there could be an API server or master instance sitting somewhere, and streams DB updates to everyone

Will we control which sources are included in this? Can the developer only update/sync a few sources/publications for a variable number of compositions? Can the user make offline only a few sources ? Etc.

Harjot1Singh commented 3 years ago

Notes so far

Harjot1Singh commented 3 years ago

Perhaps we only need to solve one issue, aside from dropping the ORM. The smart sorting:

select * from `line_content` inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `line_content`.`first_letters` like '%a%' limit 50
op: 12.902ms

vs

select * from `line_content` inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `line_content`.`first_letters` like '%a%' order by CASE  WHEN first_letters LIKE 'a' THEN 0 ELSE 1 END asc limit 50
op: 297.052ms

the CASE WHEN sorting is pretty slow... Maybe giving 100-150mb to a search index isn't such a bad idea...

Harjot1Singh commented 3 years ago

with generated serach index:

select * from `search_only` inner join `line_content` on `line_content`.`line_id` = `search_only`.`line_id` 
inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `search_only`.`query` = 'hggm' limit 50
op: 11.707ms
saihaj commented 3 years ago

Maybe giving 100-150mb to a search index isn't such a bad idea

For offline support if we plan to bundle this @shabados/mobile wouldn't that just make app size even bigger? Is there a way we can split into small chunks to keep it smaller?

Harjot1Singh commented 3 years ago

Indeed! Which is why Partial Sync support is planned for the next release!