consistent length of anzsic codes

peteowen-bzi commented 1 year ago

noticed there are no leading zeroes for any of the anzsic codes - which goes against the official structure stated by abs, shown here:

https://www.abs.gov.au/statistics/classifications/australian-and-new-zealand-standard-industrial-classification-anzsic/2006-revision-2-0/numbering-system-and-titles

https://www.abr.gov.au/government-agencies/accessing-abr-data/abr-data-dictionary/appendix-f-division-mapping-data

I'm currently having to run the following before doing any mapping

strayr::anzsic2006 %>% mutate(anzsic_subdivision_code = as.character(sprintf('%02d',as.integer(anzsic_subdivision_code))), anzsic_group_code = as.character(sprintf('%03d',as.integer(anzsic_group_code))), anzsic_class_code = as.character(sprintf('%04d',as.integer(anzsic_class_code))) )

surely the table should be saved in this format, unless i'm missing something?

williamlai2 commented 1 year ago

This is sort of related to https://github.com/runapp-aus/strayr/issues/91 and https://github.com/asiripanich/anzsic/pull/5 (as this is where the ANZSIC data is currently be sourced from.

I think if the second gets updated, then the [create_anzsic2006.R](https://github.com/runapp-aus/strayr/blob/master/data-raw/create_anzsic2006.R) function (and others) would need a minor update to get it to work.

Otherwise, the sourcing could be changed to another location.

wfmackey commented 1 year ago

hi, sorry for the slow reply.

agree that making this consistent with ABS documentation is better. strayr functions can be updated if we were to go down this route.

the only issue is that it would make backward incompatible changes to the anzsic2006 table, which will break pipelines it is being used in. two options are:

make the change and issue a warning when anzsic2006 is loaded (neater; not backwards compatible).
add the correct leading zero to additional columns with name suffixes _lz, eg anzsic_division_code_lz (messy; non-breaking change).

i think (1) is probably better, but welcome any other thoughts

peteowen-bzi commented 1 year ago

This is sort of related to #91 and asiripanich/anzsic#5 (as this is where the ANZSIC data is currently be sourced from.

I think if the second gets updated, then the [create_anzsic2006.R](https://github.com/runapp-aus/strayr/blob/master/data-raw/create_anzsic2006.R) function (and others) would need a minor update to get it to work.

Otherwise, the sourcing could be changed to another location.

Wonder if it makes more sense to extract anzsics straight from the ABS rather than through another source.

following script could give us what we want

library(tidyverse)
library(rvest)

url <- "https://www.abs.gov.au/statistics/classifications/australian-and-new-zealand-standard-industrial-classification-anzsic/2006-revision-2-0/numbering-system-and-titles/division-subdivision-group-and-class-codes-and-titles"

df <- url %>%
  rvest::read_html() %>%
  rvest::html_table()

anzsic_2006_temp <-
  purrr::list_rbind(df)

colnames(anzsic_2006_temp) <- c("anzsic_division_code", "anzsic_subdivision_code", "anzsic_group_code", "anzsic_class_code", "title")

first_row <-
  as.data.frame(t(colnames(df[[1]])))

colnames(first_row) <- c("anzsic_division_code", "anzsic_subdivision_code", "anzsic_group_code", "anzsic_class_code", "title")

anzsic_2006_total <-
  dplyr::bind_rows(first_row, anzsic_2006_temp)

anzsic_2006_total[anzsic_2006_total == ""] <- NA

anzsic_2006_fill <-
  anzsic_2006_total %>%
  tidyr::fill(colnames(anzsic_2006_total), .direction = c("down"))

#####
anzsic_2006_class <-
  anzsic_2006_total %>%
  dplyr::filter(stringr::str_detect(anzsic_class_code, "^[:digit:]+$")) %>%
  dplyr::select(anzsic_class_code, anzsic_class_title = title)

anzsic_2006_group <-
  anzsic_2006_total %>%
  dplyr::filter(stringr::str_detect(anzsic_group_code, "^[:digit:]+$")) %>%
  dplyr::select(anzsic_group_code, anzsic_group_title = title)

anzsic_2006_subdivision <-
  anzsic_2006_total %>%
  dplyr::filter(stringr::str_detect(anzsic_subdivision_code, "^[:digit:]+$")) %>%
  dplyr::select(anzsic_subdivision_code, anzsic_subdivision_title = title)

anzsic_2006_division <-
  anzsic_2006_total %>%
  dplyr::filter(stringr::str_detect(anzsic_division_code, "^[:alpha:]+$")) %>%
  dplyr::select(anzsic_division_code, anzsic_division_title = title)

#####
anzsic_2006_final <-
  anzsic_2006_fill %>%
  dplyr::left_join(anzsic_2006_division) %>%
  dplyr::left_join(anzsic_2006_subdivision) %>%
  dplyr::left_join(anzsic_2006_group) %>%
  dplyr::left_join(anzsic_2006_class) %>%
  dplyr::filter(!is.na(anzsic_class_title)) %>%
  dplyr::select(
    anzsic_division_code, anzsic_division_title, anzsic_subdivision_code, anzsic_subdivision_title,
    anzsic_group_code, anzsic_group_title, anzsic_class_code, anzsic_class_title
  ) %>%
  dplyr::as_tibble()

Maybe there's a better way of cleaning the data from the messy CSV (I hear @MattCowgill is an expert ;) )

peteowen-bzi commented 1 year ago

e change and issue a warning when anzsic2006 is loaded (neater; not backwards compatible).

probably biased cause i raised the issue - but yeah I think option 1 is better.

wfmackey commented 1 year ago

thanks @peteowen-bzi

runapp-aus / strayr

consistent length of anzsic codes #93