match row number discrepancy diagnostic code for reference

rebeccajohnson88 / qss20_s21_proj

Repo for DOL Summer Data Challenge on equity in H-2A oversight

Creative Commons Zero v1.0 Universal

2 stars 2 forks source link

indices_approved = approved_matches$matches$inds.a indices_approved_b = approved_matches$matches$inds.b sprintf("In the match object, there are %s unique indices for A; %s for B", length(unique(indices_approved)), length(unique(indices_approved_b))) both_ind = intersect(indices_approved, indices_approved_b) # see that they're the same just in diff order indices_dropped = setdiff(rownames(approved_only), indices_approved) approved_only$row_id = rownames(approved_only) View(approved_only %>% filter(row_id %in% indices_dropped) %>% select(dedupe_fields, EMPLOYER_NAME))

A couple things to discuss:

[ ] String distance threshold seems too high and deduplication results in false positives based on exact match within city/state--- option to either: (1) just do using name or (2) specify a higher distance threshold for name since i think the function has a default where you can both customize the threshold and specify diff thresholds for diff variables

False pos examples: matches due to baggs, WY but only some are true the same

True positive examples:

[ ] Noticed also that CASE_ID isn't uniquely identifying within the H2A data (more rows than case_ids)--- look more closely and discuss how we want to approach- i think it could be a given representative filing on behalf of multiple employers or something like that?

rebeccajohnson88 / qss20_s21_proj

match row number discrepancy diagnostic code for reference #16