Closed andrablaj closed 2 months ago
@witash Thank you, that's great!
My main suggestion would be to find a different structure for the fields
attribute so we don't use jsonb at all. It's likely to be a heavily used attribute and I've never been able to get jsonb to perform as well as "proper" entities. My preference would be to use a tall table to store the report id (and rev?), the full path to the field name, and the field value serialized as a string. Although I haven't performance tested this I would expect it to scale better than jsonb queries.
the idea was to have downstream models, one per form type, and to pull fields out of the jsonb fields into proper columns there.
So not to query data_record
and use the fields jsonb column, instead to use the actual columns from downstream form models.
the example from the docs PR was
SELECT
uuid,
NULLIF(fields ->> 'lmp_date','')::date as lmp, -- CAST lmp string to date or NULL if empty
NULLIF(fields ->> 'edd','')::date as edd, -- CAST edd string to date or NULL if empty
fields ->> 'lmp_method' as lmp_method,
fields ->> 'danger_signs' AS danger_signs,
fields ->> 'risk_factors' AS risk_factors,
-- extract the ANC visit number
CASE
WHEN fields ->>'anc_visit_identifier' <> ''
THEN (fields ->>'anc_visit_identifier')::int
WHEN fields #>>'{group_repeat,anc_visit_repeat,anc_visit_identifier}' <> ''
THEN RIGHT(fields #>>'{group_repeat,anc_visit_repeat,anc_visit_identifier}'::text[], 1)::int
ELSE 0 -- if not included default to 0
END AS anc_visit
FROM
{{ ref("data_record") }} couchdb
WHERE
form = 'pregnancy'
in the docs example this was materialized as a view, which maybe it would be better to have them be proper tables.
having all possible fields as columns in data_record is another option. it would avoid double storage but prevent data_record
from being a "root model" that's expected to be unchanged between apps which will have different form fields, and could get large with all possible fields being in the same table.
The data_record
model continues to be the bottleneck in term of performance (see logs below from the same setup described here) where it takes 90% of the total running time. The main worry is that as we add models we will continue to encounter such models that have an outsized effect on DBT performance. I'll explore a few more optimizations for the model before exploring alternatives to DBT and how best to handle these data transformations.
15:49:03 1 of 5 START sql incremental model v1.stable_couchdb ........................... [RUN]
16:04:05 1 of 5 OK created sql incremental model v1.stable_couchdb ...................... [INSERT 0 3491906 in 901.89s]
16:04:05 2 of 5 START sql incremental model v1.contact .................................. [RUN]
16:07:48 2 of 5 OK created sql incremental model v1.contact ............................. [INSERT 0 208541 in 223.33s]
16:07:48 3 of 5 START sql incremental model v1.data_record .............................. [RUN]
20:26:16 3 of 5 OK created sql incremental model v1.data_record ......................... [INSERT 0 1516258 in 15508.21s]
20:26:16 4 of 5 START sql incremental model v1.person ................................... [RUN]
20:31:55 4 of 5 OK created sql incremental model v1.person .............................. [INSERT 0 649662 in 338.32s]
20:31:55 5 of 5 START sql incremental model v1.chp_with_branches ........................ [RUN]
20:34:50 5 of 5 OK created sql incremental model v1.chp_with_branches ................... [INSERT 0 0 in 175.83s]
@njuguna-n @witash is there anything else to do related to this issue as the linked PR was merged?
@andrablaj nothing else to do. Closing the issue.
My ideal process for development of these models is to start with a small part of the overall model and build it out end to end so it can be tested, and then expand to the rest of the model. If we get bogged down trying to perfectly design every single field in every CouchDB doc then it'll slow down the develop > test > iterate cycle too much.
I would recommend picking the "report" type as the first one as it's a fundamental component of everything we do, and picking a basic query that could be useful based on that, perhaps "pregnancies reported per month" or something. Then the way I would develop this model is starting with the outcome we want, making sure it performs, and then working out how to get DBT to create that outcome. Off the top of my head these steps are...
I would estimate each of these steps to take around a day of development, speeding up as we get better at doing it.
The ultimate goal is to have a complete ERD for all CouchDB data that's well normalised and easy to write queries for things we haven't thought of.