medic / cht-pipeline

CHT Pipeline is a tool used to define data models for transforming the raw data we get from Couch DB into models that can then be queried to build dashboards
GNU General Public License v3.0
1 stars 4 forks source link

Create base models for CHT #90

Closed andrablaj closed 2 months ago

garethbowen commented 3 months ago

My ideal process for development of these models is to start with a small part of the overall model and build it out end to end so it can be tested, and then expand to the rest of the model. If we get bogged down trying to perfectly design every single field in every CouchDB doc then it'll slow down the develop > test > iterate cycle too much.

I would recommend picking the "report" type as the first one as it's a fundamental component of everything we do, and picking a basic query that could be useful based on that, perhaps "pregnancies reported per month" or something. Then the way I would develop this model is starting with the outcome we want, making sure it performs, and then working out how to get DBT to create that outcome. Off the top of my head these steps are...

  1. Design an Entity Relationship Diagram for reports, which might include entities for: report, field (the configurable part, consider "attribute" table), author, client, etc. This should take inspiration from FHIR for naming because it'll make life easier later, but the structure should be normalized entities. Run this past the team.
  2. Figure out how to set up DBT to load data from JSON into this structure.
  3. Then test, firstly how long does it take DBT to Transform a large number of docs, and then run analysis on the query against the DB. Consider which indexes should be added. Document the results. If either of these fail at Kenya scale, then go back to step one. Otherwise move on to the next entity (maybe contact?).

I would estimate each of these steps to take around a day of development, speeding up as we get better at doing it.

The ultimate goal is to have a complete ERD for all CouchDB data that's well normalised and easy to write queries for things we haven't thought of.

witash commented 3 months ago

see https://github.com/medic/cht-docs/pull/1394/files#diff-bbc7fb898b20d46d7c52dc093e06dc63093e3ed4469bdb2e2d9826a02cfe1a37 for proposed ER diagram

garethbowen commented 3 months ago

@witash Thank you, that's great!

My main suggestion would be to find a different structure for the fields attribute so we don't use jsonb at all. It's likely to be a heavily used attribute and I've never been able to get jsonb to perform as well as "proper" entities. My preference would be to use a tall table to store the report id (and rev?), the full path to the field name, and the field value serialized as a string. Although I haven't performance tested this I would expect it to scale better than jsonb queries.

witash commented 3 months ago

the idea was to have downstream models, one per form type, and to pull fields out of the jsonb fields into proper columns there. So not to query data_record and use the fields jsonb column, instead to use the actual columns from downstream form models. the example from the docs PR was

SELECT
  uuid,

  NULLIF(fields ->> 'lmp_date','')::date as lmp, -- CAST lmp string to date or NULL if empty
  NULLIF(fields ->> 'edd','')::date as edd, -- CAST edd string to date or NULL if empty

  fields ->> 'lmp_method' as lmp_method,
  fields ->> 'danger_signs' AS danger_signs,
  fields ->> 'risk_factors' AS risk_factors,

  -- extract the ANC visit number
  CASE
    WHEN fields ->>'anc_visit_identifier' <> ''
      THEN (fields ->>'anc_visit_identifier')::int
    WHEN fields #>>'{group_repeat,anc_visit_repeat,anc_visit_identifier}' <> ''
      THEN RIGHT(fields #>>'{group_repeat,anc_visit_repeat,anc_visit_identifier}'::text[], 1)::int
    ELSE 0 -- if not included default to 0
  END AS anc_visit

FROM
  {{ ref("data_record") }} couchdb
WHERE
  form = 'pregnancy'

in the docs example this was materialized as a view, which maybe it would be better to have them be proper tables.

having all possible fields as columns in data_record is another option. it would avoid double storage but prevent data_record from being a "root model" that's expected to be unchanged between apps which will have different form fields, and could get large with all possible fields being in the same table.

njuguna-n commented 3 months ago

The data_record model continues to be the bottleneck in term of performance (see logs below from the same setup described here) where it takes 90% of the total running time. The main worry is that as we add models we will continue to encounter such models that have an outsized effect on DBT performance. I'll explore a few more optimizations for the model before exploring alternatives to DBT and how best to handle these data transformations.

15:49:03  1 of 5 START sql incremental model v1.stable_couchdb ........................... [RUN]
16:04:05  1 of 5 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 3491906 in 901.89s]
16:04:05  2 of 5 START sql incremental model v1.contact .................................. [RUN]
16:07:48  2 of 5 OK created sql incremental model v1.contact ............................. [INSERT 0 208541 in 223.33s]
16:07:48  3 of 5 START sql incremental model v1.data_record .............................. [RUN]
20:26:16  3 of 5 OK created sql incremental model v1.data_record ......................... [INSERT 0 1516258 in 15508.21s]
20:26:16  4 of 5 START sql incremental model v1.person ................................... [RUN]
20:31:55  4 of 5 OK created sql incremental model v1.person .............................. [INSERT 0 649662 in 338.32s]
20:31:55  5 of 5 START sql incremental model v1.chp_with_branches ........................ [RUN]
20:34:50  5 of 5 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 175.83s]
andrablaj commented 2 months ago

@njuguna-n @witash is there anything else to do related to this issue as the linked PR was merged?

njuguna-n commented 2 months ago

@andrablaj nothing else to do. Closing the issue.