rebeccajohnson88 / qss20_s21_proj

Repo for DOL Summer Data Challenge on equity in H-2A oversight
Creative Commons Zero v1.0 Universal
2 stars 2 forks source link

next steps on TRLA #23

Closed rebeccajohnson88 closed 3 years ago

rebeccajohnson88 commented 3 years ago
derived_casedispo = case_when(case_disposition_orig == "Closed" | 
                                            case_disposition_more == "CLO" ~ "Closed",
                                            case_disposition_orig == "Pending" |
                                            case_disposition_more == "PEN" ~ "Pending",
                                            case_disposition_orig == "Open" | 
                                            case_disposition_more == "ACC" ~ "Open/accepted",
                                            TRUE ~ "Unknown dispo")) 
lizard12995 commented 3 years ago

re: casedispo, changed it so that case_disposition_orig supersedes case_disposition_more when it exists --

derived_casedispo = case_when(!is.na(case_disposition_orig) ~ case_disposition_orig,
                                             case_disposition_more == "CLO" ~ "Closed",
                                             case_disposition_more == "PEN" ~ "Pending",
                                             case_disposition_more == "ACC" ~ "Open",
                                             TRUE ~ "Unknown dispo"
                                             )
lizard12995 commented 3 years ago

issues coming up:

I'd filter out true duplicates --- the purpose of the fuzzy matching deduplication is not exact duplicates but instead grouping diff spellings of employer name into the same employer (job_group_id) etc- eg "Red Farm LLC" and "Red Farms LLC" are same if in same state

I think above answer should answer this --- the TRLA deduping as implemented in fuzzy matching is for similar ones rather than exact duplicates

I would keep in since we're framing outcome as possible issues not confirmed issues

Happy to walk through the logic but basic idea is: (1) we group the same employer as a "job_group_id"--- we still want to preserve all these rows but for runtime/clarity, we choose an arbitrary one to match to an investigation, (2) after a focal job within that employer matches to an investigation, we want to add the other jobs back in because they're still in our analytic dataset/should match to same investigation

I think if you write the same-ish data as script 03 i can then load both in the outcomes script and merge then

lizard12995 commented 3 years ago

Ok sounds good! Trying to run it now, and hitting this error at the point where we run full match:

ERROR: cannot coerce class ‘c("fastLink", "matchesLink")’ to a data.frame
Warning: In gammaCKpar(dfA[, varnames[i]], dfB[, varnames[i]],  ... :
There are no partial matches. We suggest either changing the value of cut.p or using gammaCK2par() instead
rebeccajohnson88 commented 3 years ago

@lizard12995 sorry just read this more closely

for this one:

ERROR: cannot coerce class ‘c("fastLink", "matchesLink")’ to a data.frame

that suggests probably an empty object is returned. i'd add print statements to debug to see what step is triggering that and then if it arises from something like a dataframe with no matches, can use something like try/except to allow it to proceed even if no matches found for that state: https://stackoverflow.com/questions/8093914/use-trycatch-skip-to-next-value-of-loop-upon-error (might also work with something like if(is.null(objname))

lizard12995 commented 3 years ago

I narrowed to just the TRLA states and it went through! Still adjusting parts of the script that are throwing errors but should be done by noon tomorrow.

rebeccajohnson88 commented 3 years ago

glad that worked! yea i think then output could either be all jobs just from those six states or all but will be non-matches for states outside that

sounds good on timing! i'll be offline for some family stuff starting at noon tomorrow through sunday midday so no huge rush before sunday

lizard12995 commented 3 years ago

update and to-dos: