medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
440 stars 211 forks source link

Split up views into different ddocs #2849

Open SCdF opened 7 years ago

SCdF commented 7 years ago

We should have a proper discussion about the pros and cons of this approach.

I'll go first.

Pros:

Cons:

As a reference, I took an example branch manager's database, and manually put each view in its own ddoc. Here are the sizes.

view size
contacts_by_freetext 15.72mb
contacts_by_parent_name_type 2.75mb
contacts_by_type 1.75mb
contacts_by_type_freetext 16.52mb
contacts_by_type_index_name 2.21mb
data_records_read_by_type 0.17mb
doc_by_type 1.56mb
doc_summaries_by_id 4.24mb
feedback 0.01mb
forms 0.01mb
help_pages 0.27mb
messages_by_contact_date 0.26mb
people_by_phone 0.02mb
places_by_contact 2.75mb
reports_by_date 0.14mb
reports_by_form 0.13mb
reports_by_form_year_month_places 0.25mb
reports_by_form_year_week_places 0.26mb
reports_by_freetext 6.85mb
reports_by_place 0.24mb
reports_by_subject 0.20mb
reports_by_validity 0.12mb
reports_by_verification 0.12mb
total_clinics_by_facility 4.60mb
Total 61.5mb

From this we can make some rough statements:

This seems like a pretty worthwhile investment to me. I'm pretty sure contacts_by_type_freetext isn't even really accessible by anyone anymore (because search has been simplified), so it would save us ~25% out of the gate.

estellecomment commented 7 years ago

:star: :unicorn_face: :stefan: :unicorn_face: :star:

SCdF commented 7 years ago

An interesting through occurs: CouchDB may be incomprehensible Erlang with potentially challenging constraints that force views to be regenerated at the ddoc level. There may be no reason that PouchDB has to be this way though (or even is this way?).

SCdF commented 7 years ago

I think we're going to drop this. PouchDB already does this and this is where our main performance concern is.

mandric commented 7 years ago

Isn't a major win with separate ddocs that we can do sane upgrades? So if you have a design doc with 30 views and you update one view, the entire index needs to rebuild, so that means gigabytes of data which can take potentially hours of rebuilding. As is, a minor update/bug fix in a view causes a very poor upgrade experience.

SCdF commented 7 years ago

Yep that's true.

nice-snek commented 7 years ago

hi frand @${assignee}, U is coot

Please close or schedule before the end of this sprint. See triaging old issues.

Nice Snek

garethbowen commented 6 years ago

Comment from Jan Lehnardt:

Quick comment on one or multiple view(s)-per-ddoc: this is a performance trade-off and not either one is always correct. But generally, I would recommend grouping all views an app would need into a single ddoc.

For each ddoc, all docs in a database have to be serialised and shipped to couchjs and the results are shipped back, that’s the bulk of the work in view indexing. Evaluating a single map/reduce function is comparatively minuscule, so grouping views in a single ddoc makes that more efficient.

SCdF commented 6 years ago

So here is another approach we could consider: Splitting views into a small collection of ddocs based on how frequently they get used.

We could work this out by querying access logs, but there are some really obvious ways to break this down:

It's also worth noting that this is only relevant on the server side: PouchDB generates views per view on the client side.

SCdF commented 6 years ago

OK, I quickly looked at a couple of projects to look at the frequency that they request views:

cat /srv/storage/medic-core/couchdb/logs/couch.log | grep "medic/_design" | grep "/_view/" | sed "s/.*medic\/_design\([^ ?]*\).*/\1/" | sort | uniq -c | sort -r

For an older pre-lineage mobile project I saw this:

  13090 /medic/_view/contacts_by_depth
  13086 /medic/_view/docs_by_replication_key
   1526 /medic-client/_view/data_records_read_by_type
    833 /medic-client/_view/reports_by_subject
    640 /medic-client/_view/doc_summaries_by_id
    601 /medic-client/_view/contacts_by_type
    523 /medic-client/_view/contacts_by_type_freetext
    420 /medic-client/_view/contacts_by_parent_name_type
    164 /medic/_view/due_tasks
     43 /medic-client/_view/forms
     26 /medic-client/_view/doc_by_type
      1 /medic/_view/usage_stats_by_year_month
      1 /medic-client/_view/contacts_by_type_index_name

And for a newer SMS project I saw this:

  23462 /medic-client/_view/data_records_by_type
  13574 /medic-client/_view/docs_by_id_lineage
   4643 /medic-client/_view/doc_summaries_by_id
   1882 /medic-client/_view/reports_by_date
    954 /medic/_view/tasks_messages
    577 /medic-client/_view/contacts_by_phone
    439 /medic-client/_view/messages_by_contact_date
    437 /medic/_view/patient_by_patient_shortcode_id
    322 /medic/_view/registered_patients
    248 /medic-client/_view/reports_by_form
    164 /medic/_view/due_tasks
    140 /medic-client/_view/reports_by_freetext
    137 /medic-client/_view/contacts_by_type
     84 /medic-client/_view/doc_by_type
     77 /medic-client/_view/reports_by_subject
     62 /medic-client/_view/contacts_by_parent
     51 /medic-client/_view/forms
      7 /medic/_view/visits_by_district_and_patient
      7 /medic/_view/delivery_reports_by_district_and_code
      4 /medic/_view/data_records
      4 /medic-client/_view/contacts_by_freetext
      1 /medic/_view/usage_stats_by_year_month
      1 /medic-client/_view/reports_by_place
      1 /medic-client/_view/data_records

We should definitely benchmark this, especially once we have a benchmark suite for replication, and once we do some sentinel performance testing.

But, it looks pretty obvious to me that we could at least put: /medic/contacts_by_depth, /medic/docs_by_replication_key, /medic-client/data_records_by_type, and /medic-client/docs_by_id_lineage into one or two separate ddocs.

Thoughts @garethbowen ?

garethbowen commented 6 years ago

Splitting views by use case makes a lot of sense to me. I think docs_by_id_lineage will be used by everyone so I don't think that works, but what about this...

Projects that do both use cases may be slightly worse off but we don't have many (any?) of those right now, except maybe some standard ones and they're small by definition so should be ok.

SCdF commented 6 years ago

@garethbowen that is definitely one approach we could take to splitting these up. I'm hoping that once we have a benchmark suite we can make better decisions. It might be that they can be split up by function like you suggest. It might be that we actually want to split them up by frequency instead, which may look random and not cleanly line up as your example might

alxndrsn commented 6 years ago

It might be that they can be split up by function like you suggest. It might be that we actually want to split them up by frequency instead, which may look random and not cleanly line up as your example might

To some extent, frequency (f) seems to be a product of function - e.g. projects without SMS should have f=0 for the proposed medic-gateway ddoc.

Ref medic-client, it obviously makes sense to keep all views together for use on the client, but might there still be advantage to pulling views that the server uses heavily into separate ddoc(s)?

garethbowen commented 2 years ago

Not blocking for 4.0.0