Closed rebeccajohnson88 closed 3 years ago
re: casedispo, changed it so that case_disposition_orig supersedes case_disposition_more when it exists --
derived_casedispo = case_when(!is.na(case_disposition_orig) ~ case_disposition_orig,
case_disposition_more == "CLO" ~ "Closed",
case_disposition_more == "PEN" ~ "Pending",
case_disposition_more == "ACC" ~ "Open",
TRUE ~ "Unknown dispo"
)
issues coming up:
I'd filter out true duplicates --- the purpose of the fuzzy matching deduplication is not exact duplicates but instead grouping diff spellings of employer name into the same employer (job_group_id) etc- eg "Red Farm LLC" and "Red Farms LLC" are same if in same state
I think above answer should answer this --- the TRLA deduping as implemented in fuzzy matching is for similar ones rather than exact duplicates
I would keep in since we're framing outcome as possible issues not confirmed issues
Happy to walk through the logic but basic idea is: (1) we group the same employer as a "job_group_id"--- we still want to preserve all these rows but for runtime/clarity, we choose an arbitrary one to match to an investigation, (2) after a focal job within that employer matches to an investigation, we want to add the other jobs back in because they're still in our analytic dataset/should match to same investigation
I think if you write the same-ish data as script 03 i can then load both in the outcomes script and merge then
Ok sounds good! Trying to run it now, and hitting this error at the point where we run full match:
ERROR: cannot coerce class ‘c("fastLink", "matchesLink")’ to a data.frame
Warning: In gammaCKpar(dfA[, varnames[i]], dfB[, varnames[i]], ... :
There are no partial matches. We suggest either changing the value of cut.p or using gammaCK2par() instead
@lizard12995 sorry just read this more closely
for this one:
ERROR: cannot coerce class ‘c("fastLink", "matchesLink")’ to a data.frame
that suggests probably an empty object is returned. i'd add print statements to debug to see what step is triggering that and then if it arises from something like a dataframe with no matches, can use something like try/except to allow it to proceed even if no matches found for that state: https://stackoverflow.com/questions/8093914/use-trycatch-skip-to-next-value-of-loop-upon-error (might also work with something like if(is.null(objname))
I narrowed to just the TRLA states and it went through! Still adjusting parts of the script that are throwing errors but should be done by noon tomorrow.
glad that worked! yea i think then output could either be all jobs just from those six states or all but will be non-matches for states outside that
sounds good on timing! i'll be offline for some family stuff starting at noon tomorrow through sunday midday so no huge rush before sunday
update and to-dos:
[ ] state_formatch - we are currently matching on employer state, but I am going to ask if opponent is listed by the OPPONENT'S address or where the violation occurred (in which case we'd want to match on worksite state)
[ ] Things are going well, except when I get to the section where I add job duplicates back in, each job will match with the appropriate opponent, but if there are multiple cases against that opponent (different dates or different case numbers) the jobs are matching with different trla cases. I am seeing job posting case numbers duplicated to match two different TRLA investigations in two rows. I also see TRLA case numbers duplicated to match two different job postings in two rows. Is that what we want?
[x] copy a version of this script and name it something like
13_fuzzymatching_TRLA.R
- https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/03_fuzzy_matching.R[x] work on edits to that script where, in places where that script uses the investigations data, substitute in the TRLA data- we can either touch base to go over general script logic before you dig into it or once you've started making modifications; the basic logic is: (1) dedupe both datasets (for jobs, you can just read in the ones with the job_group_ids added; for TRLA, maybe skip this step), (2) fuzzy match between the two, (3) add back the rows removed during the deduplication