pepfar-datim / datapackr

Creative Commons Zero v1.0 Universal
9 stars 7 forks source link

Flawed logic in dedupe resolution. #157

Closed jason-p-pickering closed 4 years ago

jason-p-pickering commented 4 years ago

There was an issue in the following lines of code:

dplyr::group_by(PSNU,psnuid,indicator_code,Age,Sex,KeyPop,support_type) %>% 
    dplyr::summarize(distribution = sum(distribution)) %>% 
    dplyr::mutate(distribution_diff = abs(distribution - 1.0)) %>% 
    dplyr::filter(distribution_diff >= 1e-3 & distribution != 1.0) %>% 

So, since the data was being grouped by support_type and then summed...well, it was just wrong. Sloppy copy and paste from the pure dedupe section.

The correct way to identify dedupes is to calculate the count of components (DSD/DSD or TA/TA) for pure duplication, and for crosswalks, to determine if there is any DSD/TA allocation for the same data element disagg. There is no need at the identification phase to worry about what the allocation is. Its better just to count and see how many potential data element/disaggs overlap, and then filter for the 100% allocations.

jason-p-pickering commented 4 years ago

Resolved in bbc5e65dad049aa83909dd2c888d822d0cc27aa4