Flawed logic in dedupe resolution.

There was an issue in the following lines of code:

dplyr::group_by(PSNU,psnuid,indicator_code,Age,Sex,KeyPop,support_type) %>% 
    dplyr::summarize(distribution = sum(distribution)) %>% 
    dplyr::mutate(distribution_diff = abs(distribution - 1.0)) %>% 
    dplyr::filter(distribution_diff >= 1e-3 & distribution != 1.0) %>%

So, since the data was being grouped by support_type and then summed...well, it was just wrong. Sloppy copy and paste from the pure dedupe section.

The correct way to identify dedupes is to calculate the count of components (DSD/DSD or TA/TA) for pure duplication, and for crosswalks, to determine if there is any DSD/TA allocation for the same data element disagg. There is no need at the identification phase to worry about what the allocation is. Its better just to count and see how many potential data element/disaggs overlap, and then filter for the 100% allocations.

pepfar-datim / datapackr

Flawed logic in dedupe resolution. #157