indices_approved = approved_matches$matches$inds.a
indices_approved_b = approved_matches$matches$inds.b
sprintf("In the match object, there are %s unique indices for A; %s for B",
length(unique(indices_approved)),
length(unique(indices_approved_b)))
both_ind = intersect(indices_approved, indices_approved_b) # see that they're the same just in diff order
indices_dropped = setdiff(rownames(approved_only), indices_approved)
approved_only$row_id = rownames(approved_only)
View(approved_only %>% filter(row_id %in% indices_dropped) %>% select(dedupe_fields, EMPLOYER_NAME))
[ ] String distance threshold seems too high and deduplication results in false positives based on exact match within city/state--- option to either: (1) just do using name or (2) specify a higher distance threshold for name since i think the function has a default where you can both customize the threshold and specify diff thresholds for diff variables
False pos examples: matches due to baggs, WY but only some are true the same
True positive examples:
[ ] Noticed also that CASE_ID isn't uniquely identifying within the H2A data (more rows than case_ids)--- look more closely and discuss how we want to approach- i think it could be a given representative filing on behalf of multiple employers or something like that?
deleted from final script but here