medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
468 stars 217 forks source link

Research CouchDB Nouveau search as way to reduce disk space (TCO) #9542

Open mrjones-plip opened 1 month ago

mrjones-plip commented 1 month ago

Research how viable it will be to use CouchDB's Nouveau search to both improve search but to also reduce disk use of the indexes. This should include, but not be limited to:

m5r commented 1 week ago

I'll try to answer here the questions asked in the original issue and any question that might come up along the way.

how fast is it to index?

Much much faster than our regular couch views! I remade the contacts_by_freetext view using Nouveau and it took my computer 5.5 minutes to index the nouveau contacts_by_freetext view with MoH Zanzibar's dataset for ~400MB of disk space used. In comparison, reindexing the regular contacts_by_freetext view means reindexing every view in medic-client took about 2.5 hours. The dataset has roughly 2M contacts and 2.8M reports. This comparison is not apples to apples yet but it gives us an idea of what to expect.

TODO: also index key:value pairs then measure again but with the contacts_by_type_freetext and reports_by_freetext views as nouveau indexes. Update on this todo: I indexed all 3 views and they represent 842MB on disk. I haven't gotten around to index key:value pairs because nouveau seems to complain about malformed keys. I will provide an update on this at a later point.

how hard will it be to bifurcate offline search using the old view from online search using this new search

TBD

how hard will it be to package up and maintain the java file which powers Nouveau

Easy using the docker image the couchdb team publishes.

how hard will it be to measure database disk use?

It's all contained in a single docker volume, it's easy to check manually. I don't know if APIs expose this data to measure programmatically but I will check later and report back here. The /{db}/_design/{ddoc}/_nouveau_info/{index} API does expose this data! It's pretty neat, it's even more granular than the _info API for regular views, we can see how much disk space each nouveau view takes on disk instead of having the data for the ddoc as a whole. So how hard will it be to measure disk use? Pretty easy.

how hard will it be to upgrade the index?

First, upgrading nouveau alongside couchdb right now can break existing indexes. It broke during the 3.4.1 => 3.4.2 upgrade and the couchdb team quickly fixed the underlying problem. With that said, they are committed to not cause a nouveau view reindexing on upgrades and are working towards automatic view reindexing, from their slack:

there will be a control on concurrency of rebuild, couchdb/nouveau will keep track of which indexes need rebuilding, and will then rebuild them over time, switching to the new one once complete and deleting the old.

So hopefully upgrades will be smooth when nouveau stabilizes.

And second, changing the view by modifying the code and index documents differently should be as straightforward as changing a regular couch view. It's essentially a JS function that lives in a ddoc, same as regular views we already have. I need to double check if the view gets reindexed as soon as the ddoc is changed or on the first query.

how much disk savings do we see?

TL;DR: I saw ~25% disk savings for MoH Zanzibar.

Starting with the existing snapshot of MoH Zanzibar after compaction and views cleanup, disk usage of CouchDB was 55GB. Removing the 3 freetext views from medic-client and re-running compaction, disk usage went down to 40GB. After indexing those 3 freetext views with Nouveau, Nouveau's disk usage was 800MB but let's round it up to 1GB. So we went down from 55GB to (40 + 1)GB, netting roughly 25% savings.

I haven't noticed Nouveau's disk usage going above 1GB during the 5 minutes it took to index the views but I'm planning to make changes to https://github.com/jkuester/chtoolbox/ to monitor that as well. As mentioned earlier, couch exposes an API to help with this. This will come in handy on top of our current disk monitoring functionality.

what else are we forgetting?

how important is the order value emitted in the freetext views and how can we replicate this behavior with nouveau?

var order = dead + ' ' + muted + ' ' + idx + ' ' + (doc.name && doc.name.toLowerCase());

It seems to be used to put the dead and muted contacts at the end of the search results and order the results alphabetically. By default, Nouveau sorts results by relevance. At query time, we can pass a sort parameter that tells nouveau to sort results based on the field(s) passed in that parameter. Since it's relying on fields that are in the document, we will have to find a workaround to keep this working. The most obvious workaround would be to migrate every contact document to create a sorting_order field and have a transition to catch deaths reports (with undo's) and muting to update this field. This is not ideal, it would be better if this could be calculated in the view like it is today.

more to come...

mrjones-plip commented 1 week ago

this is great - thanks @m5r !

One question that didn't ask, which i've just added to the body is:

How much disk savings do we see?

In more detail: If we start with the freetext view we have now for online users, delete it, recreate it in Nouveau, what's the savings? As well, when upgrading/recreating the index in Nouveau, how much spare ephemeral disk do we need? Do we know the percent of total disk the freetext view takes up so we can try and compare to what Nouveau will take up?

Feel free to break this into it's own sub-ticket if the research seems a rabbit hole unto itself!

m5r commented 2 days ago

Great question @mrjones-plip 👀 I updated my previous comment with this week's updates and an answer to your question

mrjones-plip commented 11 hours ago

Thanks for the write up @m5r !

we went down from 55GB to (40 + 1)GB, netting roughly 25% savings

To be clear, this is a 25% savings on the medic-client database, not on all couchdb + Nouveau databases, correct?

I'm planning to make changes to chtoolbox to monitor [view creation disk use] as well

yes! i love this idea.