Write a function to match messily-named ABS categories (similar to `strayr` for states)

runapp-aus / strayr

A catalogue of ready-to-use ABS coding structures. Package documentation can be found here: https://runapp-aus.github.io/strayr/

45 stars 15 forks source link

Write a function to match messily-named ABS categories (similar to `strayr` for states) #10

Closed daviddiviny closed 3 years ago

daviddiviny commented 3 years ago

I would like to write a function that is equivalent to the matching functionality of the strayr package.

This is currently an open issue in readabs.

This would help with the issue that the part of the ABS that publishes the weekly payroll and wages data likes to prepend numbers and letters and take an inconsistent approach to capitalisation and use of ampersands.

Weekly Payroll Jobs and Wages 01. A-Agriculture, forestry & fishing

6291.0.55.001 Labour Force, Australia, Detailed Agriculture, Forestry and Fishing

A possible approach would be:

[x] Remove leading digits and capital letters
[x] Remove punctuation
[x] Convert to lower case
[x] Remove stop words like 'and'
[x] Look for identical matches

We could also add fuzzy-matching but this should not be the default.

I think the function should return a vector for the purposes of including in mutate.

Thoughts?

wfmackey commented 3 years ago

Love it. Would be useful for lots of structures, and seems relatively easy to implement.

Do you envision a common 'correct' database of names that things are matched to? (ie whatever is in the official ABS documentation, drawn from abscorr).

And should the user need to provide the structure, eg:

weekly_payroll %>% 
  mutate(anzsic = clean_anzsic(industry))

and the level? eg

weekly_payroll %>% 
  mutate(anzsic = clean_anzsic(industry, level = "division"))

weekly_payroll %>% 
  mutate(anzsic = clean_anzsic_division(industry)

the above would help if we wanted to enter the fuzzy matching world. But if not, we could potentially have a master function clean_abs_names (or whatever) that cleans the input and matches with an output across ANZSCO/ANZSIC/etc/etc lines:

weekly_payroll %>% 
  mutate(anzsic = clean_abs_names(industry)

wfmackey commented 3 years ago

(I think the latter is the best)

wfmackey commented 3 years ago

Something like this @daviddiviny? Could be pretty easily expanded to a clean_abs_names() function:

library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.0.2
#> Warning: package 'tibble' was built under R version 4.0.2

clean_anzsic <- function(x) {

  # cleaning function
  .clean_anzsic <- function(x) {
    x %>% 
      str_remove_all("^[0-9]*") %>%
      str_remove_all("^[\\.\\s]*") %>%
      str_remove_all("^[A-Z]-") %>%
      str_remove_all("[:punct:]") %>% 
      str_remove_all("\\s?and") %>% 
      str_replace_all("[\\s]{2,100}", " ") %>% 
      tolower()
    # etc etc etc
  }

  # get reference database to match
  anzsic_basic <- abscorr::anzsic %>% 
    distinct(anzsic_division) %>% 
    mutate(anzsic_division_basic = .clean_anzsic(anzsic_division))

  # clean original input
  clean_x <- .clean_anzsic(x)

  # matching
  if (clean_x %in% anzsic_basic$anzsic_division_basic) {
    ret <- anzsic_basic %>% 
      filter(anzsic_division_basic == clean_x) %>% 
      pull(anzsic_division)
  } else {
    ret <- NA_character_
  }

  return(ret)

}

clean_anzsic("01. A-Agriculture, forestry & fishing")
#> [1] "Agriculture, Forestry and Fishing"
clean_anzsic("Agriculture, forestry & fishing")
#> [1] "Agriculture, Forestry and Fishing"

clean_anzsic("01. A-Agriculture, forestry & fishing") == clean_anzsic("Agriculture, forestry & fishing")
#> [1] TRUE

clean_anzsic("Not an ANZSIC")
#> [1] NA

^{Created on 2021-04-17 by the reprex package (v1.0.0.9001)}

wfmackey commented 3 years ago

@daviddiviny do you want to do this?

daviddiviny commented 3 years ago

Yep.

daviddiviny commented 3 years ago

Hi @MattCowgill and @wfmackey and anyone else.

I'd appreciate your feedback on how this function is shaping up. It is on the clean_anzsic branch.

I've written a bunch of helper functions in the clean_helpers.R script. The idea is that it should make it easy to create new clean_ functions through using the make_dictionary and the clean_titles functions.

I have written functions for a package that are this abstract before so I'd appreciate your input on how they are designed etc.

daviddiviny commented 3 years ago

I think the major question I have is whether return_na should be an argument (set to TRUE) or whether it should always return NA.

Also, should the functionality of strayr be imported and refactored using this approach @MattCowgill ?

MattCowgill commented 3 years ago

Hi @daviddiviny, your approach in clean_helpers looks good to me. Yes, I think strayr's functions should probably be brought into this package (and maybe the name donated as well?). And yep refactoring using your functions makes sense to me

daviddiviny commented 3 years ago

To dos:

[x] anzsic
[x] anzsco
[x] strayr
[x] asced