Closed daviddiviny closed 3 years ago
Love it. Would be useful for lots of structures, and seems relatively easy to implement.
Do you envision a common 'correct' database of names that things are matched to? (ie whatever is in the official ABS documentation, drawn from abscorr
).
And should the user need to provide the structure, eg:
weekly_payroll %>%
mutate(anzsic = clean_anzsic(industry))
and the level? eg
weekly_payroll %>%
mutate(anzsic = clean_anzsic(industry, level = "division"))
or
weekly_payroll %>%
mutate(anzsic = clean_anzsic_division(industry)
the above would help if we wanted to enter the fuzzy matching world. But if not, we could potentially have a master function clean_abs_names
(or whatever) that cleans the input and matches with an output across ANZSCO/ANZSIC/etc/etc lines:
weekly_payroll %>%
mutate(anzsic = clean_abs_names(industry)
(I think the latter is the best)
Something like this @daviddiviny? Could be pretty easily expanded to a clean_abs_names()
function:
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.0.2
#> Warning: package 'tibble' was built under R version 4.0.2
clean_anzsic <- function(x) {
# cleaning function
.clean_anzsic <- function(x) {
x %>%
str_remove_all("^[0-9]*") %>%
str_remove_all("^[\\.\\s]*") %>%
str_remove_all("^[A-Z]-") %>%
str_remove_all("[:punct:]") %>%
str_remove_all("\\s?and") %>%
str_replace_all("[\\s]{2,100}", " ") %>%
tolower()
# etc etc etc
}
# get reference database to match
anzsic_basic <- abscorr::anzsic %>%
distinct(anzsic_division) %>%
mutate(anzsic_division_basic = .clean_anzsic(anzsic_division))
# clean original input
clean_x <- .clean_anzsic(x)
# matching
if (clean_x %in% anzsic_basic$anzsic_division_basic) {
ret <- anzsic_basic %>%
filter(anzsic_division_basic == clean_x) %>%
pull(anzsic_division)
} else {
ret <- NA_character_
}
return(ret)
}
clean_anzsic("01. A-Agriculture, forestry & fishing")
#> [1] "Agriculture, Forestry and Fishing"
clean_anzsic("Agriculture, forestry & fishing")
#> [1] "Agriculture, Forestry and Fishing"
clean_anzsic("01. A-Agriculture, forestry & fishing") == clean_anzsic("Agriculture, forestry & fishing")
#> [1] TRUE
clean_anzsic("Not an ANZSIC")
#> [1] NA
Created on 2021-04-17 by the reprex package (v1.0.0.9001)
@daviddiviny do you want to do this?
Yep.
Hi @MattCowgill and @wfmackey and anyone else.
I'd appreciate your feedback on how this function is shaping up. It is on the clean_anzsic
branch.
I've written a bunch of helper functions in the clean_helpers.R script. The idea is that it should make it easy to create new clean_
functions through using the make_dictionary
and the clean_titles
functions.
I have written functions for a package that are this abstract before so I'd appreciate your input on how they are designed etc.
I think the major question I have is whether return_na
should be an argument (set to TRUE) or whether it should always return NA.
Also, should the functionality of strayr
be imported and refactored using this approach @MattCowgill ?
Hi @daviddiviny, your approach in clean_helpers
looks good to me.
Yes, I think strayr
's functions should probably be brought into this package (and maybe the name donated as well?). And yep refactoring using your functions makes sense to me
To dos:
I would like to write a function that is equivalent to the matching functionality of the
strayr
package.This is currently an open issue in
readabs
.This would help with the issue that the part of the ABS that publishes the weekly payroll and wages data likes to prepend numbers and letters and take an inconsistent approach to capitalisation and use of ampersands.
A possible approach would be:
We could also add fuzzy-matching but this should not be the default.
I think the function should return a vector for the purposes of including in
mutate
.Thoughts?