Open SCdF opened 7 years ago
:star: :unicorn_face: :stefan: :unicorn_face: :star:
An interesting through occurs: CouchDB may be incomprehensible Erlang with potentially challenging constraints that force views to be regenerated at the ddoc level. There may be no reason that PouchDB has to be this way though (or even is this way?).
I think we're going to drop this. PouchDB already does this and this is where our main performance concern is.
Isn't a major win with separate ddocs that we can do sane upgrades? So if you have a design doc with 30 views and you update one view, the entire index needs to rebuild, so that means gigabytes of data which can take potentially hours of rebuilding. As is, a minor update/bug fix in a view causes a very poor upgrade experience.
Yep that's true.
hi frand @${assignee}, U is coot
Please close or schedule before the end of this sprint. See triaging old issues.
Nice Snek
Comment from Jan Lehnardt:
Quick comment on one or multiple view(s)-per-ddoc: this is a performance trade-off and not either one is always correct. But generally, I would recommend grouping all views an app would need into a single ddoc.
For each ddoc, all docs in a database have to be serialised and shipped to couchjs and the results are shipped back, that’s the bulk of the work in view indexing. Evaluating a single map/reduce function is comparatively minuscule, so grouping views in a single ddoc makes that more efficient.
So here is another approach we could consider: Splitting views into a small collection of ddocs based on how frequently they get used.
We could work this out by querying access logs, but there are some really obvious ways to break this down:
docs_by_id_lineage
) will be run between every single document, and so should be in as small a ddoc as possibleIt's also worth noting that this is only relevant on the server side: PouchDB generates views per view on the client side.
OK, I quickly looked at a couple of projects to look at the frequency that they request views:
cat /srv/storage/medic-core/couchdb/logs/couch.log | grep "medic/_design" | grep "/_view/" | sed "s/.*medic\/_design\([^ ?]*\).*/\1/" | sort | uniq -c | sort -r
For an older pre-lineage mobile project I saw this:
13090 /medic/_view/contacts_by_depth
13086 /medic/_view/docs_by_replication_key
1526 /medic-client/_view/data_records_read_by_type
833 /medic-client/_view/reports_by_subject
640 /medic-client/_view/doc_summaries_by_id
601 /medic-client/_view/contacts_by_type
523 /medic-client/_view/contacts_by_type_freetext
420 /medic-client/_view/contacts_by_parent_name_type
164 /medic/_view/due_tasks
43 /medic-client/_view/forms
26 /medic-client/_view/doc_by_type
1 /medic/_view/usage_stats_by_year_month
1 /medic-client/_view/contacts_by_type_index_name
And for a newer SMS project I saw this:
23462 /medic-client/_view/data_records_by_type
13574 /medic-client/_view/docs_by_id_lineage
4643 /medic-client/_view/doc_summaries_by_id
1882 /medic-client/_view/reports_by_date
954 /medic/_view/tasks_messages
577 /medic-client/_view/contacts_by_phone
439 /medic-client/_view/messages_by_contact_date
437 /medic/_view/patient_by_patient_shortcode_id
322 /medic/_view/registered_patients
248 /medic-client/_view/reports_by_form
164 /medic/_view/due_tasks
140 /medic-client/_view/reports_by_freetext
137 /medic-client/_view/contacts_by_type
84 /medic-client/_view/doc_by_type
77 /medic-client/_view/reports_by_subject
62 /medic-client/_view/contacts_by_parent
51 /medic-client/_view/forms
7 /medic/_view/visits_by_district_and_patient
7 /medic/_view/delivery_reports_by_district_and_code
4 /medic/_view/data_records
4 /medic-client/_view/contacts_by_freetext
1 /medic/_view/usage_stats_by_year_month
1 /medic-client/_view/reports_by_place
1 /medic-client/_view/data_records
We should definitely benchmark this, especially once we have a benchmark suite for replication, and once we do some sentinel performance testing.
But, it looks pretty obvious to me that we could at least put: /medic/contacts_by_depth, /medic/docs_by_replication_key, /medic-client/data_records_by_type, and /medic-client/docs_by_id_lineage into one or two separate ddocs.
Thoughts @garethbowen ?
Splitting views by use case makes a lot of sense to me. I think docs_by_id_lineage
will be used by everyone so I don't think that works, but what about this...
medic
or medic-client
ddocs: Core views that all projects use. eg: reports_by_freetext
, forms
medic-gateway
ddoc: The views only needed for SMS projects, eg: tasks_messages
, due_tasks
, contacts_by_phone
medic-replication
ddoc: The views only needed for smartphone projects, eg: contacts_by_depth
, docs_by_replication_key
Projects that do both use cases may be slightly worse off but we don't have many (any?) of those right now, except maybe some standard ones and they're small by definition so should be ok.
@garethbowen that is definitely one approach we could take to splitting these up. I'm hoping that once we have a benchmark suite we can make better decisions. It might be that they can be split up by function like you suggest. It might be that we actually want to split them up by frequency instead, which may look random and not cleanly line up as your example might
It might be that they can be split up by function like you suggest. It might be that we actually want to split them up by frequency instead, which may look random and not cleanly line up as your example might
To some extent, frequency (f) seems to be a product of function - e.g. projects without SMS should have f=0
for the proposed medic-gateway
ddoc.
Ref medic-client
, it obviously makes sense to keep all views together for use on the client, but might there still be advantage to pulling views that the server uses heavily into separate ddoc(s)?
Not blocking for 4.0.0
We should have a proper discussion about the pros and cons of this approach.
I'll go first.
Pros:
Cons:
As a reference, I took an example branch manager's database, and manually put each view in its own ddoc. Here are the sizes.
From this we can make some rough statements:
This seems like a pretty worthwhile investment to me. I'm pretty sure
contacts_by_type_freetext
isn't even really accessible by anyone anymore (because search has been simplified), so it would save us ~25% out of the gate.