Closed jecihjoy closed 8 months ago
Thanks, @jecihjoy for documenting this!
Please find attached monthly summary statistics for key metrics in the past 5 years (2015-2020). To improve granularity, I have provided monthly instead of yearly summaries so that when we generate the synthetic dataset it can reflect this distribution and pattern.
encounters_monthly_aggregates.xlsx
Here is the SQL query used to generate (FYI this is just a draft and not final):
SELECT
date_format(encounter_datetime,'%m-%Y') as month_year,
MONTH(encounter_datetime) as month,
YEAR(encounter_datetime) as year,
COUNT(DISTINCT `encounter_id`) AS distinct_encounters,
COUNT(DISTINCT `visit_id`) AS distinct_visits,
COUNT(DISTINCT if(prev_clinical_datetime_hiv is null,`person_id`,null)) AS newly_enrolled,
COUNT(DISTINCT if(encounter_datetime >= arv_start_date,`person_id`,null)) as patients_on_arvs,
COUNT(DISTINCT if((encounter_datetime < arv_start_date or arv_start_date is null),`person_id`,null)) as patients_not_on_arvs,
COUNT(DISTINCT if(vl_1 < 1000, `encounter_id` ,null)) as vl_suppressed_encounters,
COUNT( DISTINCT if(vl_1 >= 1000, `encounter_id`,null)) as vl_failure_encouunters,
COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_datetime, INTERVAL 30 DAY)), endDate) > 90, `person_id`, null)) as LTFU
FROM
etl.dates
JOIN etl.flat_hiv_summary_v15b on date(encounter_datetime) <= date(endDate)
WHERE is_clinical_encounter = 1 and (next_clinical_datetime_hiv is null or date(next_clinical_datetime_hiv) > endDate) and
encounter_datetime >= timestamp(date("2015-01-01")) #and encounter_datetime<= date('2020-03-01')
group by MONTH(encounter_datetime), YEAR(encounter_datetime)
order by encounter_datetime desc
@kimaina Thanks!
For generating VL values, from what I've understood, AMPATH pretty much consistently has actual VL values reported in the resulting Obs, but this isn't the case for all implementations. See this comment by Mike Seaton and just my general overview of the landscape I reviewed. So the question is: do we want to simulate data where we always have a numeric VL result or should we include some number of these non-numeric results? Opinions?
So the question is: do we want to simulate data where we always have a numeric VL result or should we include some number of these non-numeric results? Opinions?
I think we should simulate both quantitative and qualitative VL results since the majority of implementers have both. From my experience working on this, we normally have 2 different concepts representing either the quantitative value or the VL qualifier (LDL or <300).
I'm surprised but I supposed it's possible that people are recording this categorically. Does anyone have an example of a ciel concept that people are using to record this?
On Wed, Nov 11, 2020 at 9:35 AM Allan Kimaina notifications@github.com wrote:
So the question is: do we want to simulate data where we always have a numeric VL result or should we include some number of these non-numeric results? Opinions?
I think we should simulate both quantitative and qualitative VL results since the majority of implementers have both. From my experience working on this, we normally have 2 different concepts representing either the quantitative value or the VL qualifier (LDL or <300).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/openmrs-fhir-analytics/issues/59#issuecomment-725559089, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABCFHVNIOXXSSV2EWSDNSLSPLDM7ANCNFSM4TQ6GYPQ .
-- Jonathan Dick, MD Chief Medical Information Officer, AMPATH Assistant Professor of Clinical Medicine, IU School of Medicine Research Scientist, Regenstrief Institute Inc.
@jonathandick CIEL:1305 (qualitative VL) is the base concept for qualitative VLs and at least CIEL:1302 (Not detected) and CIEL:1306 (Beyond detectable limit) are used to record suppressed VLs in KenyaEMR and UgandaEMR respectively. No idea how PIH are representing the "< 300" or "< 500" part. Theoretically, OpenMRS has a ComplexNumeric datatype to handle those, but AFAIK it's never actually been used.
@kimaina Could you get us an estimate of the # encounters and visits / year (or month) for suppressed versus non-suppressed patients? JJ had indicated that VL non-suppressed patients will be seen more regularly and I'd like a way to estimate that, if possible. Thanks!
@kimaina Could you get us an estimate of the # encounters and visits / year (or month) for suppressed versus non-suppressed patients? JJ had indicated that VL non-suppressed patients will be seen more regularly and I'd like a way to estimate that, if possible. Thanks!
Hey @ibacher,I have provided these 4 estimates in the above excel --> distinct_encounters, distinct_visits, vl_suppressed_encounters, and vl_failure_encounters. I'll rename the above columns to be more understandable, in the next version
@ibacher do you prefer monthly aggregates or yearly aggregates?
@kimaina Ah! I see... then do I have a way of estimating the percent of patients suppressed vs patients unsuppressed?
do you prefer monthly aggregates or yearly aggregates?
Monthly is generally more useful, I just wasn't sure if a single patient was likely to have multiple visits per month or not.
I'll also add-in
@kimaina Ah! I see... then do I have a way of estimating the percent of patients suppressed vs patients unsuppressed?
Monthly is generally more useful, I just wasn't sure if a single patient was likely to have multiple visits per month or not.
We can also do yearly + monthly aggregates
I have updated the query to include a few more metrics as well as some of @jonathandick and @ibacher suggestions e.g :
This is still a draft and I still need to add @ibacher suggestions on
SELECT
date_format(endDate,'%m-%Y') as month_year,
# Visits
SUM(hmsd.visit_this_month) AS `visit_this_month`,
SUM(hmsd.scheduled_this_month) AS `scheduled_this_month`,
SUM(hmsd.unscheduled_this_month) AS `unscheduled_this_month`,
SUM(hmsd.enrolled_this_month) AS `enrolled_this_month`,
# LTFU and active in care
COUNT( DISTINCT `person_id` ) as distinct_patients,
SUM(hmsd.active_in_care_this_month) AS `active_in_care_this_month`,
AVG(hmsd.days_since_rtc_date) AS `avg_days_since_rtc_date`,
COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_date, INTERVAL 30 DAY)), endDate) > 90, `person_id`, null)) as LTFU_since_2015,
COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_date, INTERVAL 30 DAY)), endDate) > 90 AND TIMESTAMPDIFF(DAY, encounter_date,`endDate`) <365, `person_id`, null)) as LTFU_past_year,
COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_date, INTERVAL 30 DAY)), endDate) > 90 AND TIMESTAMPDIFF(DAY, encounter_date,`endDate`) <365/2, `person_id`, null)) as LTFU_past_6months,
COUNT( DISTINCT if( days_since_rtc_date > 90, `person_id`, null)) as LTFU_this_month,
COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, death_date,`endDate`) < 30,`person_id`, null)) as deaths_this_month,
COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, death_date,`endDate`) < 365,`person_id`, null)) as deaths_past_year,
COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, transfer_out_date,`endDate`) < 30,`person_id`, null)) as transfer_out_this_month,
COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, transfer_out_date,`endDate`) < 365,`person_id`, null)) as transfer_out_past_year,
# ART Status
SUM(hmsd.art_revisit_this_month) AS `art_revisit_this_month`,
SUM(hmsd.is_pre_art_this_month) AS `pre_art_this_month`,
SUM(hmsd.on_art_this_month) AS `on_art_this_month`,
SUM(hmsd.started_art_this_month) AS `started_art_this_month`,
SUM(had_med_change_this_month) AS `had_med_change_this_month`,
# ART Line
SUM(hmsd.on_original_first_line_this_month) AS `on_original_first_line_this_month`,
SUM(hmsd.on_alt_first_line_this_month) AS `on_alt_first_line_this_month`,
SUM(hmsd.on_second_line_or_higher_this_month) AS `on_second_line_or_higher_this_month`,
# Viral Load
COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `encounter_id`, null)) as vl_suppressed_encounters_this_month,
COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `encounter_id`, null)) as vl_unsuppressed_encounters_this_month,
COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `person_id`, null)) as vl_suppressed_patients_this_month,
COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `person_id`, null)) as vl_unsuppressed_patients_this_month,
COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 365, `person_id`, null)) as vl_suppressed_patients_past_year,
COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) <365, `person_id`, null)) as vl_unsuppressed_patients_past_year,
COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 365, `encounter_id`, null)) as vl_suppressed_encounters_past_year,
COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 365, `encounter_id`, null)) as vl_unsuppressed_encounters_past_year,
SUM(hmsd.due_for_vl_this_month) AS `due_for_vl_this_month`,
SUM(hmsd.qualifies_for_follow_up_vl) AS `qualifies_for_follow_up_vl`,
SUM(hmsd.got_follow_up_vl_this_month) AS `got_follow_up_vl_this_month`,
SUM(hmsd.follow_up_vl_suppressed_this_month) AS `follow_up_vl_suppressed_this_month`,
SUM(hmsd.follow_up_vl_unsuppressed_this_month) AS `follow_up_vl_unsuppressed_this_month`,
SUM(hmsd.follow_up_vl_suppressed_this_month) AS `follow_up_vl_unsuppressed`,
SUM(hmsd.follow_up_vl_suppressed_this_month) AS `follow_up_vl_suppressed`,
AVG(hmsd.num_days_to_follow_vl) AS `avg_num_days_to_follow_vl`
FROM
etl.hiv_monthly_report_dataset_frozen `hmsd`
WHERE
(endDate >= '2015-01-01')
GROUP BY endDate
order by endDate desc
here is a sample xlxs 2020_monthly_aggregates.xlsx
@kimaina Thanks for the update on this! Just to ensure this is properly documented, I'm creating a little simulator for this data over in this repo: https://github.com/ibacher/hiv-test-data.
Is it possible to run the numbers over the same 5-ish year period? From the previous data set, it looked like June 2020 - November 2020 was a bit of an outlier period (I'm guessing it's related to the current pandemic).
hey, @ibacher I have completed generating the necessary data elements for generating the required metrics for the simulation. Quick question, do you prefer proportions or percentages?
https://github.com/kimaina/vl-indicators/blob/main/2020_monthly_aggregates_v2.csv
I’ve been converting everything into percentages, apart from raw counts, but whatever is easiest!
you mean proportions, right! I'm happy to provide proportions and counts
I have completed creating the query with the recommendations, I'll be running it during off-peak to generate the dataset then get back to you
Hi @ibacher
Please see the PR: https://github.com/ibacher/hiv-test-data/pulls
You will find:
proportions.csv
containing key proportions script.Rmd
is the R script used to generate portions2020_monthly_aggregates_v2.csv
raw dataset containing over 55 indicators@kimaina Thanks for the PRs and all the data!
This is handled in #179.
@kimaina to calculate these summary statistics to help in the generation of data