Split up views into different ddocs

SCdF commented 7 years ago

We should have a proper discussion about the pros and cons of this approach.

I'll go first.

Pros:

It makes it very clear how large each view is, which helps with keeping track of performance
It reduces the likelyhood of unused views eating ram / disk space / cpu cycles, since views are only (re)generated when you query them
Importantly for mobile, it makes sure that views users aren't actually using don't contribute to ram usage. If you never search a report by freetext, it will never get indexed.
It would make the application snappier, since when (P|C)ouchDB boots and has to start loading views it only builds exactly what is necessary

Cons:

It makes our code more complicated
We'd have to work out how to deploy it in a non annoying way (maybe we've already done this re: medic-client)
It makes it harder to know that you've warmed all the caches, since you have to hit each view individually. This could lead to the application feeling "bursty" and unpredictable.
It hides issues, and could lead us to be less disciplined in what we put in medic-client.

As a reference, I took an example branch manager's database, and manually put each view in its own ddoc. Here are the sizes.

view	size
contacts_by_freetext	15.72mb
contacts_by_parent_name_type	2.75mb
contacts_by_type	1.75mb
contacts_by_type_freetext	16.52mb
contacts_by_type_index_name	2.21mb
data_records_read_by_type	0.17mb
doc_by_type	1.56mb
doc_summaries_by_id	4.24mb
feedback	0.01mb
forms	0.01mb
help_pages	0.27mb
messages_by_contact_date	0.26mb
people_by_phone	0.02mb
places_by_contact	2.75mb
reports_by_date	0.14mb
reports_by_form	0.13mb
reports_by_form_year_month_places	0.25mb
reports_by_form_year_week_places	0.26mb
reports_by_freetext	6.85mb
reports_by_place	0.24mb
reports_by_subject	0.20mb
reports_by_validity	0.12mb
reports_by_verification	0.12mb
total_clinics_by_facility	4.60mb
Total	61.5mb

From this we can make some rough statements:

If users never search for contacts, they reduce their problem area by ~50%
If they never search for contacts or reports they reduce their problem area by ~60%

This seems like a pretty worthwhile investment to me. I'm pretty sure contacts_by_type_freetext isn't even really accessible by anyone anymore (because search has been simplified), so it would save us ~25% out of the gate.

estellecomment commented 7 years ago

:star: :unicorn_face: :stefan: :unicorn_face: :star:

SCdF commented 7 years ago

An interesting through occurs: CouchDB may be incomprehensible Erlang with potentially challenging constraints that force views to be regenerated at the ddoc level. There may be no reason that PouchDB has to be this way though (or even is this way?).

SCdF commented 7 years ago

I think we're going to drop this. PouchDB already does this and this is where our main performance concern is.

mandric commented 7 years ago

Isn't a major win with separate ddocs that we can do sane upgrades? So if you have a design doc with 30 views and you update one view, the entire index needs to rebuild, so that means gigabytes of data which can take potentially hours of rebuilding. As is, a minor update/bug fix in a view causes a very poor upgrade experience.

SCdF commented 7 years ago

Yep that's true.

nice-snek commented 7 years ago

hi frand @${assignee}, U is coot

Please close or schedule before the end of this sprint. See triaging old issues.

Nice Snek

garethbowen commented 6 years ago

Comment from Jan Lehnardt:

Quick comment on one or multiple view(s)-per-ddoc: this is a performance trade-off and not either one is always correct. But generally, I would recommend grouping all views an app would need into a single ddoc.

For each ddoc, all docs in a database have to be serialised and shipped to couchjs and the results are shipped back, that’s the bulk of the work in view indexing. Evaluating a single map/reduce function is comparatively minuscule, so grouping views in a single ddoc makes that more efficient.

SCdF commented 6 years ago

So here is another approach we could consider: Splitting views into a small collection of ddocs based on how frequently they get used.

We could work this out by querying access logs, but there are some really obvious ways to break this down:

Anything sentinel uses in transitions (docs_by_id_lineage) will be run between every single document, and so should be in as small a ddoc as possible
There are probably large views that are never used programmatically (eg the two large search views) which could be in their own ddoc.

It's also worth noting that this is only relevant on the server side: PouchDB generates views per view on the client side.

SCdF commented 6 years ago

OK, I quickly looked at a couple of projects to look at the frequency that they request views:

cat /srv/storage/medic-core/couchdb/logs/couch.log | grep "medic/_design" | grep "/_view/" | sed "s/.*medic\/_design\([^ ?]*\).*/\1/" | sort | uniq -c | sort -r

For an older pre-lineage mobile project I saw this:

  13090 /medic/_view/contacts_by_depth
  13086 /medic/_view/docs_by_replication_key
   1526 /medic-client/_view/data_records_read_by_type
    833 /medic-client/_view/reports_by_subject
    640 /medic-client/_view/doc_summaries_by_id
    601 /medic-client/_view/contacts_by_type
    523 /medic-client/_view/contacts_by_type_freetext
    420 /medic-client/_view/contacts_by_parent_name_type
    164 /medic/_view/due_tasks
     43 /medic-client/_view/forms
     26 /medic-client/_view/doc_by_type
      1 /medic/_view/usage_stats_by_year_month
      1 /medic-client/_view/contacts_by_type_index_name

And for a newer SMS project I saw this:

  23462 /medic-client/_view/data_records_by_type
  13574 /medic-client/_view/docs_by_id_lineage
   4643 /medic-client/_view/doc_summaries_by_id
   1882 /medic-client/_view/reports_by_date
    954 /medic/_view/tasks_messages
    577 /medic-client/_view/contacts_by_phone
    439 /medic-client/_view/messages_by_contact_date
    437 /medic/_view/patient_by_patient_shortcode_id
    322 /medic/_view/registered_patients
    248 /medic-client/_view/reports_by_form
    164 /medic/_view/due_tasks
    140 /medic-client/_view/reports_by_freetext
    137 /medic-client/_view/contacts_by_type
     84 /medic-client/_view/doc_by_type
     77 /medic-client/_view/reports_by_subject
     62 /medic-client/_view/contacts_by_parent
     51 /medic-client/_view/forms
      7 /medic/_view/visits_by_district_and_patient
      7 /medic/_view/delivery_reports_by_district_and_code
      4 /medic/_view/data_records
      4 /medic-client/_view/contacts_by_freetext
      1 /medic/_view/usage_stats_by_year_month
      1 /medic-client/_view/reports_by_place
      1 /medic-client/_view/data_records

We should definitely benchmark this, especially once we have a benchmark suite for replication, and once we do some sentinel performance testing.

But, it looks pretty obvious to me that we could at least put: /medic/contacts_by_depth, /medic/docs_by_replication_key, /medic-client/data_records_by_type, and /medic-client/docs_by_id_lineage into one or two separate ddocs.

Thoughts @garethbowen ?

garethbowen commented 6 years ago

Splitting views by use case makes a lot of sense to me. I think docs_by_id_lineage will be used by everyone so I don't think that works, but what about this...

medic or medic-client ddocs: Core views that all projects use. eg: reports_by_freetext, forms
medic-gateway ddoc: The views only needed for SMS projects, eg: tasks_messages, due_tasks, contacts_by_phone
medic-replication ddoc: The views only needed for smartphone projects, eg: contacts_by_depth, docs_by_replication_key

Projects that do both use cases may be slightly worse off but we don't have many (any?) of those right now, except maybe some standard ones and they're small by definition so should be ok.

SCdF commented 6 years ago

@garethbowen that is definitely one approach we could take to splitting these up. I'm hoping that once we have a benchmark suite we can make better decisions. It might be that they can be split up by function like you suggest. It might be that we actually want to split them up by frequency instead, which may look random and not cleanly line up as your example might

alxndrsn commented 6 years ago

It might be that they can be split up by function like you suggest. It might be that we actually want to split them up by frequency instead, which may look random and not cleanly line up as your example might

To some extent, frequency (f) seems to be a product of function - e.g. projects without SMS should have f=0 for the proposed medic-gateway ddoc.

Ref medic-client, it obviously makes sense to keep all views together for use on the client, but might there still be advantage to pulling views that the server uses heavily into separate ddoc(s)?

garethbowen commented 2 years ago

Not blocking for 4.0.0

medic / cht-core

Split up views into different ddocs #2849