next steps on TRLA - Githubissues

rebeccajohnson88 commented 3 years ago

[x] review commit here and see if there's any other changes you'd want to make- one part i wasn't sure about was standardizing the case dispositions between the two in this code chunk, but we also aren't likely to use that other than descriptively

derived_casedispo = case_when(case_disposition_orig == "Closed" | 
                                            case_disposition_more == "CLO" ~ "Closed",
                                            case_disposition_orig == "Pending" |
                                            case_disposition_more == "PEN" ~ "Pending",
                                            case_disposition_orig == "Open" | 
                                            case_disposition_more == "ACC" ~ "Open/accepted",
                                            TRUE ~ "Unknown dispo"))

[x] copy a version of this script and name it something like 13_fuzzymatching_TRLA.R- https://github.com/rebeccajohnson88/qss20_s21_proj/blob/main/code/03_fuzzy_matching.R
[x] work on edits to that script where, in places where that script uses the investigations data, substitute in the TRLA data- we can either touch base to go over general script logic before you dig into it or once you've started making modifications; the basic logic is: (1) dedupe both datasets (for jobs, you can just read in the ones with the job_group_ids added; for TRLA, maybe skip this step), (2) fuzzy match between the two, (3) add back the rows removed during the deduplication

lizard12995 commented 3 years ago

re: casedispo, changed it so that case_disposition_orig supersedes case_disposition_more when it exists --

derived_casedispo = case_when(!is.na(case_disposition_orig) ~ case_disposition_orig,
                                             case_disposition_more == "CLO" ~ "Closed",
                                             case_disposition_more == "PEN" ~ "Pending",
                                             case_disposition_more == "ACC" ~ "Open",
                                             TRUE ~ "Unknown dispo"
                                             )

lizard12995 commented 3 years ago

issues coming up:

TRLA data looks like it has some actual duplicates (same case number, intake date, and opponent -- should we remove?); I think there should be one listing of the opponent per case number and intake date.

I'd filter out true duplicates --- the purpose of the fuzzy matching deduplication is not exact duplicates but instead grouping diff spellings of employer name into the same employer (job_group_id) etc- eg "Red Farm LLC" and "Red Farms LLC" are same if in same state

Then, how does that relate to TRLA deduping? We could dedupe on the script 11 side, or I can dedupe on name, case_number, derived_opponent_state, and derived_intakedate

I think above answer should answer this --- the TRLA deduping as implemented in fuzzy matching is for similar ones rather than exact duplicates

do we want to filter out rejected TRLA cases? usually not rejected for cause, but for resource restriction/eligibility, but still could be a flag for readers.

I would keep in since we're framing outcome as possible issues not confirmed issues

Why add duplicates back in? Just duplicate names? I get lost around 353 -- but have changed dataset/variable names! so maybe I don't need to completely understand

Happy to walk through the logic but basic idea is: (1) we group the same employer as a "job_group_id"--- we still want to preserve all these rows but for runtime/clarity, we choose an arbitrary one to match to an investigation, (2) after a focal job within that employer matches to an investigation, we want to add the other jobs back in because they're still in our analytic dataset/should match to same investigation

Any changes at the end of the script to merge it with the other jobs data?

I think if you write the same-ish data as script 03 i can then load both in the outcomes script and merge then

lizard12995 commented 3 years ago

Ok sounds good! Trying to run it now, and hitting this error at the point where we run full match:

ERROR: cannot coerce class ‘c("fastLink", "matchesLink")’ to a data.frame
Warning: In gammaCKpar(dfA[, varnames[i]], dfB[, varnames[i]],  ... :
There are no partial matches. We suggest either changing the value of cut.p or using gammaCK2par() instead

rebeccajohnson88 commented 3 years ago

@lizard12995 sorry just read this more closely

for this one:

ERROR: cannot coerce class ‘c("fastLink", "matchesLink")’ to a data.frame

that suggests probably an empty object is returned. i'd add print statements to debug to see what step is triggering that and then if it arises from something like a dataframe with no matches, can use something like try/except to allow it to proceed even if no matches found for that state: https://stackoverflow.com/questions/8093914/use-trycatch-skip-to-next-value-of-loop-upon-error (might also work with something like if(is.null(objname))

lizard12995 commented 3 years ago

I narrowed to just the TRLA states and it went through! Still adjusting parts of the script that are throwing errors but should be done by noon tomorrow.

rebeccajohnson88 commented 3 years ago

glad that worked! yea i think then output could either be all jobs just from those six states or all but will be non-matches for states outside that

sounds good on timing! i'll be offline for some family stuff starting at noon tomorrow through sunday midday so no huge rush before sunday

lizard12995 commented 3 years ago

update and to-dos:

[ ] state_formatch - we are currently matching on employer state, but I am going to ask if opponent is listed by the OPPONENT'S address or where the violation occurred (in which case we'd want to match on worksite state)
[ ] Things are going well, except when I get to the section where I add job duplicates back in, each job will match with the appropriate opponent, but if there are multiple cases against that opponent (different dates or different case numbers) the jobs are matching with different trla cases. I am seeing job posting case numbers duplicated to match two different TRLA investigations in two rows. I also see TRLA case numbers duplicated to match two different job postings in two rows. Is that what we want?

rebeccajohnson88 / qss20_s21_proj

next steps on TRLA #23