medic / cht-pipeline

CHT Pipeline is a tool used to define data models for transforming the raw data we get from Couch DB into models that can then be queried to build dashboards
GNU General Public License v3.0
1 stars 4 forks source link

BRAC Postgres config vs dbt config results comparison #10

Closed billwambua closed 3 weeks ago

billwambua commented 2 years ago

Compare results from the brac config in rdbms vs the dbt config in http://cht-pipeline-test.caseqzeqqeog.eu-west-2.rds.amazonaws.com/.

Compare results for:

get_dhis2_data()
get_dashboard_data_hh_brac()
get_dashboard_data_anc_impact()
get_fp_data()
get_hmis_data()
get_dashboard_data()
get_muted_contacts()
get_assessment_data()
get_dashboard_family_registration()
get_dashboard_data_iccm_impact()
get_dashboard_data_iccm_impact_u1()

They should be the same.

billwambua commented 2 years ago

Selection_087

Found 1568/6396 matches for the time period 2021-03-01 to 2021-03-03 for the get_hmis_data function using hash matches.

billwambua commented 2 years ago

Selection_088

For 2021-01-01 to 2021-01-30, 1220 out of 6396 records match.

billwambua commented 2 years ago

For Q1 2021(2021-01-01 to 2021-01-31), the matching records are 4076 out of 19188.

Selection_089

derickl commented 2 years ago

@wambua, it's likely this was cloned from brac.dev. Theres a dump of the couchdb table in brac-rdbms.dev under /srv/backup/dump/ that you can use to restore into the cht-pipeline database.

billwambua commented 2 years ago

After importing the new brac_ug DB 9194 out of 19194 records match, these are the results:

    SELECT t.chp_uuid, t.date, md5((ROW(t.*)::TEXT)) as hash
    FROM get_hmis_data('2021-01-01', '2021-03-31') t ;

9194  out of  19194  records from Brac-ug DB match cht_pipeline_test dbt DB
Time taken: 5272.167739152908
PostgreSQL connection is closed
derickl commented 2 years ago

It would be helpful if you also checked whether they returned the same number of rows

billwambua commented 2 years ago

19188 records for cht_pipeline_test vs 19194 records for Brac. Difference of 6 rows.

billwambua commented 2 years ago

The problem is these columns

num_of_hh_bed_nets  
num_of_hh_latrines  
num_of_hh_safe_water

Which are empty in the Brac DB pull but have data when pulled from CHT-DataPipeline

All the other columns in the get_hmis_data pull match except those three.

LEFT JOIN 
(

    SELECT  
        chw, 
        area_uuid,  
        date_trunc('month',reported) :: date  AS reported_month, 
        SUM((mosquito_nets ='true')::int) AS num_of_hh_bed_nets, 
        SUM((hygeinic_toilet ='true')::int) AS num_of_hh_latrines, 
        SUM((COALESCE(source_of_drinking_water,'') != 'spring')::int) AS num_of_hh_safe_water
    FROM dbt.useview_household_survey

    GROUP BY 
        chw, 
        area_uuid, 
        reported_month 

) AS family_survey ON(period_chp.date = family_survey.reported_month  AND  period_chp.area_uuid =family_survey.area_uuid)

More here - https://github.com/derickl/cht-pipeline/blob/08ca9d8d51b3c17361e2fd6ee5fa00b7aba3e3c9/models/brac/brac_hmis.sql#L74

Trying to see why it is behaving like that

derickl commented 2 years ago

Nice find @billwambua! (i hope the column comparisons are scripted to make it easier to do the same across the other datasets) That would be weird since the same data is being used across.

Could you check the definition of what is actually running on postgres and compare that with what's on github?

billwambua commented 2 years ago

@derickl checking. Since we did some modifications on the function, I copied the same exact function which is on Brac to CHT-Pipeline test and did a pull - the data from those indicated columns again did not match.

I checked the underlying useview for 1 chv and the data matches, but the function pull does not match.

I'm still trying to understand why it does not match when pulled via the function.

billwambua commented 2 years ago

Got it: The definition of useview_household_survey in brac-ug is different from what is on Github.

On Github and consequently on the cht_pipeline_test config, the definition of the fields chw and area_uuid is:

doc #>> '{parent,contact,_id}'::text[] AS chw,
doc #>> '{parent,_id}'::text[] AS area_uuid,

While on the brac-ug DB, it is defined as

doc #>> '{parent,contact,_id}'::text[] AS chw,
doc #>> '{parent,contact,parent,_id}'::text[] AS area_uuid,

Notice the difference in area_uuid.

This is causing the Brac-DB to show empty fields for the area_uuid field which is used in joins in get_hmis_data and so the num_of_hh_bed_nets and num_of_hh_latrines and num_of_hh_safe_water fields are all empty when pulled from brac-ug database.

@jonathanbataire you need to update the useview_household_survey materialized view to use the latest config on Github.

billwambua commented 2 years ago

As for get_assessment_data, for Q1 2021, 7213 out of 9737 records match.

SELECT t.chw_uuid, t.month, md5((ROW(t.*)::TEXT)) as hash
    FROM get_assessment_data('2021-01-01', '2021-03-03') t ;

7213  out of  9737  records from Brac-ug DB match cht_pipeline_test dbt DB
Time taken: 2695.9290194511414

Issue seems to be the column percent_mrdt which is rendered as a float. cht_pipeline_test is giving 16 decimal places while brac-ug is giving 15 decimal places. e.g for chw_uuid - ff7a5739-53f4-41c5-99d3-f8eeb2f6da92 and month - 202101, the return is:

0.307692307692308 (brac-ug)

0.3076923076923077 (cht_pipeline_test)

If I do the below query on both brac-ug and cht_pipeline_test servers, I get 0.3076923076923077 for both. So I'm a bit stumped as to why this is happening.

select ((4+0)::float/(13+0+0)::float)::float

The code excerpt that does the division is this:

 CASE
    WHEN (COALESCE(ASSESS.mrdt_positive,0) + COALESCE(ASSESS.mrdt_negative,0) + COALESCE(ASSESS.mrdt_none,0)) = 0
    THEN 0::float
    ELSE
      (COALESCE(ASSESS.mrdt_positive,0) + COALESCE(ASSESS.mrdt_negative,0))::float / (COALESCE(ASSESS.mrdt_positive,0) + COALESCE(ASSESS.mrdt_negative,0) + COALESCE(ASSESS.mrdt_none,0))::float
  END AS percent_mrdt,
billwambua commented 2 years ago

For get_dashboard_data, 0 rows match exactly.

Picking one row and doing investigations, i.e Kibuli Q1 2021, I used the Pregnancies Registered data point. It uses the MAX(EDD) to determine the latest pregnancy. The issue with this is if say a mother has a pregnancy with LMP early Year 1 with delivery in late Year 1 then has another pregnancy in mid Year 2, then only the second pregnancy will be counted.

That is the issue with using MAX only without a JOIN to the corresponding Delivery and making that a group.

If you do the below query for patient Atuhaire Diana with patient_id 6bf02d77-c47e-4c30-ae6f-811fb94cdf47 in brac-ug.

SELECT
              chw AS CHW_UUID,
              patient_id AS patient_id,
              date(edd) AS EDD

            FROM
               useview_pregnancy
            WHERE patient_id = '6bf02d77-c47e-4c30-ae6f-811fb94cdf47'

You get:

chw_uuid                                                         patient_id                                                        edd
0b18c388-4eb8-432f-92f6-c7bf35d13ef8    6bf02d77-c47e-4c30-ae6f-811fb94cdf47    2023-02-15
0b18c388-4eb8-432f-92f6-c7bf35d13ef8    6bf02d77-c47e-4c30-ae6f-811fb94cdf47    2021-02-17

But the below query to check for Q1 2021 gives a blank.

SELECT
              chw AS CHW_UUID,
              patient_id AS patient_id,
              max(date(edd)) AS EDD

            FROM
               useview_pregnancy
            WHERE patient_id = '6bf02d77-c47e-4c30-ae6f-811fb94cdf47'

            GROUP BY
              CHW_UUID,
              PATIENT_ID

            HAVING
              max(date(edd)) >= (date_trunc('day','2021-01-01'::TIMESTAMP)) AND

              CASE
                WHEN date_trunc('day','2021-03-31'::TIMESTAMP) = date_trunc('day',now())
                THEN max(date(edd)) <= ('2021-03-31'::TIMESTAMP - '3 days'::interval)
                ELSE max(date(edd)) <= ('2021-03-31'::TIMESTAMP + '1 day'::interval)
              END
Phil-Mwago commented 6 months ago

@njuguna-n is this still relevant? Can I close it?

andrablaj commented 6 months ago

@Phil-Mwago, when you re-open issues, please change the status to TODO in the board, as that is not done automatically.