Generate random test data that mimic a large OpenMRS HIV dataset

jecihjoy commented 4 years ago

@kimaina to calculate these summary statistics to help in the generation of data

For virally suppressed and unsuppressed patients calculate the average number of encounters per year
Average number of LTFUs per year, days since rtc > 30
Assume all patients are on ARVs

kimaina commented 4 years ago

Thanks, @jecihjoy for documenting this!

kimaina commented 4 years ago

Please find attached monthly summary statistics for key metrics in the past 5 years (2015-2020). To improve granularity, I have provided monthly instead of yearly summaries so that when we generate the synthetic dataset it can reflect this distribution and pattern.

encounters_monthly_aggregates.xlsx

Here is the SQL query used to generate (FYI this is just a draft and not final):

SELECT 
    date_format(encounter_datetime,'%m-%Y') as month_year,
    MONTH(encounter_datetime) as month,
    YEAR(encounter_datetime) as year,    
    COUNT(DISTINCT `encounter_id`) AS distinct_encounters,
    COUNT(DISTINCT `visit_id`) AS distinct_visits,
    COUNT(DISTINCT if(prev_clinical_datetime_hiv is null,`person_id`,null)) AS newly_enrolled,
    COUNT(DISTINCT if(encounter_datetime >= arv_start_date,`person_id`,null)) as patients_on_arvs,
    COUNT(DISTINCT if((encounter_datetime < arv_start_date  or arv_start_date is null),`person_id`,null)) as patients_not_on_arvs,
    COUNT(DISTINCT if(vl_1 < 1000, `encounter_id` ,null)) as vl_suppressed_encounters,
    COUNT( DISTINCT if(vl_1 >= 1000, `encounter_id`,null)) as vl_failure_encouunters,
    COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_datetime, INTERVAL 30 DAY)), endDate) > 90,  `person_id`, null)) as LTFU

FROM
    etl.dates 
JOIN etl.flat_hiv_summary_v15b  on date(encounter_datetime) <= date(endDate)
WHERE is_clinical_encounter = 1  and (next_clinical_datetime_hiv is null or date(next_clinical_datetime_hiv) > endDate) and
encounter_datetime >= timestamp(date("2015-01-01"))  #and encounter_datetime<= date('2020-03-01')
group by MONTH(encounter_datetime),  YEAR(encounter_datetime)
order by encounter_datetime desc

ibacher commented 4 years ago

@kimaina Thanks!

For generating VL values, from what I've understood, AMPATH pretty much consistently has actual VL values reported in the resulting Obs, but this isn't the case for all implementations. See this comment by Mike Seaton and just my general overview of the landscape I reviewed. So the question is: do we want to simulate data where we always have a numeric VL result or should we include some number of these non-numeric results? Opinions?

kimaina commented 4 years ago

So the question is: do we want to simulate data where we always have a numeric VL result or should we include some number of these non-numeric results? Opinions?

I think we should simulate both quantitative and qualitative VL results since the majority of implementers have both. From my experience working on this, we normally have 2 different concepts representing either the quantitative value or the VL qualifier (LDL or <300).

jonathandick commented 4 years ago

I'm surprised but I supposed it's possible that people are recording this categorically. Does anyone have an example of a ciel concept that people are using to record this?

On Wed, Nov 11, 2020 at 9:35 AM Allan Kimaina notifications@github.com wrote:

So the question is: do we want to simulate data where we always have a numeric VL result or should we include some number of these non-numeric results? Opinions?

I think we should simulate both quantitative and qualitative VL results since the majority of implementers have both. From my experience working on this, we normally have 2 different concepts representing either the quantitative value or the VL qualifier (LDL or <300).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/openmrs-fhir-analytics/issues/59#issuecomment-725559089, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABCFHVNIOXXSSV2EWSDNSLSPLDM7ANCNFSM4TQ6GYPQ .

-- Jonathan Dick, MD Chief Medical Information Officer, AMPATH Assistant Professor of Clinical Medicine, IU School of Medicine Research Scientist, Regenstrief Institute Inc.

ibacher commented 4 years ago

@jonathandick CIEL:1305 (qualitative VL) is the base concept for qualitative VLs and at least CIEL:1302 (Not detected) and CIEL:1306 (Beyond detectable limit) are used to record suppressed VLs in KenyaEMR and UgandaEMR respectively. No idea how PIH are representing the "< 300" or "< 500" part. Theoretically, OpenMRS has a ComplexNumeric datatype to handle those, but AFAIK it's never actually been used.

ibacher commented 4 years ago

@kimaina Could you get us an estimate of the # encounters and visits / year (or month) for suppressed versus non-suppressed patients? JJ had indicated that VL non-suppressed patients will be seen more regularly and I'd like a way to estimate that, if possible. Thanks!

kimaina commented 4 years ago

@kimaina Could you get us an estimate of the # encounters and visits / year (or month) for suppressed versus non-suppressed patients? JJ had indicated that VL non-suppressed patients will be seen more regularly and I'd like a way to estimate that, if possible. Thanks!

Hey @ibacher,I have provided these 4 estimates in the above excel --> distinct_encounters, distinct_visits, vl_suppressed_encounters, and vl_failure_encounters. I'll rename the above columns to be more understandable, in the next version

kimaina commented 4 years ago

@ibacher do you prefer monthly aggregates or yearly aggregates?

ibacher commented 4 years ago

@kimaina Ah! I see... then do I have a way of estimating the percent of patients suppressed vs patients unsuppressed?

do you prefer monthly aggregates or yearly aggregates?

Monthly is generally more useful, I just wasn't sure if a single patient was likely to have multiple visits per month or not.

kimaina commented 4 years ago

I'll also add-in

the number of unique patients per months
of these patients how many are suppressed
and how many are not

@kimaina Ah! I see... then do I have a way of estimating the percent of patients suppressed vs patients unsuppressed?

kimaina commented 4 years ago

Monthly is generally more useful, I just wasn't sure if a single patient was likely to have multiple visits per month or not.

We can also do yearly + monthly aggregates

kimaina commented 4 years ago

I have updated the query to include a few more metrics as well as some of @jonathandick and @ibacher suggestions e.g :

How likely is it that a patient will remain active in care this month?
How likely is it that a patient will get a viral load test this month if the patient is eligible for a test?
How likely is it that the test will be < 1000?

This is still a draft and I still need to add @ibacher suggestions on

% of patients LTFU who return to care and, ideally, the meantime to return to care.

SELECT 
    date_format(endDate,'%m-%Y') as month_year,
    # Visits 
    SUM(hmsd.visit_this_month) AS `visit_this_month`,
    SUM(hmsd.scheduled_this_month) AS `scheduled_this_month`,
    SUM(hmsd.unscheduled_this_month) AS `unscheduled_this_month`,
    SUM(hmsd.enrolled_this_month) AS `enrolled_this_month`,

    # LTFU and active in care
    COUNT( DISTINCT `person_id` ) as distinct_patients,
    SUM(hmsd.active_in_care_this_month) AS `active_in_care_this_month`,
    AVG(hmsd.days_since_rtc_date) AS `avg_days_since_rtc_date`,
    COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_date, INTERVAL 30 DAY)), endDate) > 90,  `person_id`, null)) as LTFU_since_2015,
    COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_date, INTERVAL 30 DAY)), endDate) > 90  AND TIMESTAMPDIFF(DAY, encounter_date,`endDate`) <365,  `person_id`, null)) as LTFU_past_year,
    COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_date, INTERVAL 30 DAY)), endDate) > 90  AND TIMESTAMPDIFF(DAY, encounter_date,`endDate`) <365/2,  `person_id`, null)) as LTFU_past_6months,
    COUNT( DISTINCT if( days_since_rtc_date > 90,  `person_id`, null)) as LTFU_this_month,
    COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, death_date,`endDate`) < 30,`person_id`, null)) as deaths_this_month,
    COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, death_date,`endDate`) < 365,`person_id`, null)) as deaths_past_year,
    COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, transfer_out_date,`endDate`) < 30,`person_id`, null)) as transfer_out_this_month,
    COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, transfer_out_date,`endDate`) < 365,`person_id`, null)) as transfer_out_past_year,

    # ART Status
    SUM(hmsd.art_revisit_this_month) AS `art_revisit_this_month`,
        SUM(hmsd.is_pre_art_this_month) AS `pre_art_this_month`,
    SUM(hmsd.on_art_this_month) AS `on_art_this_month`,
         SUM(hmsd.started_art_this_month) AS `started_art_this_month`,
    SUM(had_med_change_this_month) AS `had_med_change_this_month`,

     # ART Line
     SUM(hmsd.on_original_first_line_this_month) AS `on_original_first_line_this_month`,
     SUM(hmsd.on_alt_first_line_this_month) AS `on_alt_first_line_this_month`,
     SUM(hmsd.on_second_line_or_higher_this_month) AS `on_second_line_or_higher_this_month`,

    # Viral Load
    COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `encounter_id`, null)) as vl_suppressed_encounters_this_month,
    COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `encounter_id`, null)) as vl_unsuppressed_encounters_this_month,
    COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `person_id`, null)) as vl_suppressed_patients_this_month,
    COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `person_id`, null)) as vl_unsuppressed_patients_this_month,
    COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 365, `person_id`, null)) as vl_suppressed_patients_past_year,
    COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) <365, `person_id`, null)) as vl_unsuppressed_patients_past_year,
    COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 365, `encounter_id`, null)) as vl_suppressed_encounters_past_year,
    COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 365, `encounter_id`, null)) as vl_unsuppressed_encounters_past_year,

    SUM(hmsd.due_for_vl_this_month) AS `due_for_vl_this_month`,
    SUM(hmsd.qualifies_for_follow_up_vl) AS `qualifies_for_follow_up_vl`,
        SUM(hmsd.got_follow_up_vl_this_month) AS `got_follow_up_vl_this_month`,
        SUM(hmsd.follow_up_vl_suppressed_this_month) AS `follow_up_vl_suppressed_this_month`,
        SUM(hmsd.follow_up_vl_unsuppressed_this_month) AS `follow_up_vl_unsuppressed_this_month`,
        SUM(hmsd.follow_up_vl_suppressed_this_month) AS `follow_up_vl_unsuppressed`,
        SUM(hmsd.follow_up_vl_suppressed_this_month) AS `follow_up_vl_suppressed`,
        AVG(hmsd.num_days_to_follow_vl) AS `avg_num_days_to_follow_vl`

FROM
    etl.hiv_monthly_report_dataset_frozen `hmsd`
WHERE
    (endDate >= '2015-01-01')
GROUP BY endDate
order by endDate desc

here is a sample xlxs 2020_monthly_aggregates.xlsx

ibacher commented 4 years ago

@kimaina Thanks for the update on this! Just to ensure this is properly documented, I'm creating a little simulator for this data over in this repo: https://github.com/ibacher/hiv-test-data.

Is it possible to run the numbers over the same 5-ish year period? From the previous data set, it looked like June 2020 - November 2020 was a bit of an outlier period (I'm guessing it's related to the current pandemic).

kimaina commented 3 years ago

hey, @ibacher I have completed generating the necessary data elements for generating the required metrics for the simulation. Quick question, do you prefer proportions or percentages?

https://github.com/kimaina/vl-indicators/blob/main/2020_monthly_aggregates_v2.csv

ibacher commented 3 years ago

I’ve been converting everything into percentages, apart from raw counts, but whatever is easiest!

kimaina commented 3 years ago

you mean proportions, right! I'm happy to provide proportions and counts

kimaina commented 3 years ago

I have completed creating the query with the recommendations, I'll be running it during off-peak to generate the dataset then get back to you

kimaina commented 3 years ago

Hi @ibacher

Please see the PR: https://github.com/ibacher/hiv-test-data/pulls

You will find:

proportions.csv containing key proportions
script.Rmd is the R script used to generate portions
2020_monthly_aggregates_v2.csv raw dataset containing over 55 indicators

ibacher commented 3 years ago

@kimaina Thanks for the PRs and all the data!

bashir2 commented 9 months ago

This is handled in #179.

google / fhir-data-pipes

Generate random test data that mimic a large OpenMRS HIV dataset #59