idem-lab / conmat

Create Contact Matrices from Population Data
https://idem-lab.github.io/conmat/dev/
Other
14 stars 2 forks source link

finish cleaning ABS education data #14

Open njtierney opened 2 years ago

njtierney commented 2 years ago

We were running into issues getting education data cleaned up - here https://github.com/njtierney/conmat/blob/master/data-raw/clean-education.R#L120

njtierney commented 2 years ago

demonstrated by this figure

library(conmat)
library(ggplot2)
ggplot(
  abs_education_state_2020,
  aes(
    x = population,
    y = population_interpolated
  )
) +
  geom_point() +
  geom_abline() +
  theme(aspect.ratio = 1) + 
  facet_wrap(~state,
             ncol = 4)

Created on 2021-09-08 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.0 (2021-05-18) #> os macOS Big Sur 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Perth #> date 2021-09-08 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0) #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) #> colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0) #> conmat * 0.0.0.9000 2021-09-08 [1] local #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) #> curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0) #> dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) #> farver 2.1.0 2021-02-28 [1] CRAN (R 4.1.0) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) #> ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0) #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0) #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0) #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0) #> labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.0) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) #> mime 0.11 2021-06-23 [1] CRAN (R 4.1.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0) #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0) #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) #> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.1.0) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) #> scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) #> stringi 1.7.3 2021-07-16 [1] CRAN (R 4.1.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) #> styler 1.4.1 2021-03-30 [1] CRAN (R 4.1.0) #> tibble 3.1.3 2021-07-23 [1] CRAN (R 4.1.0) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) #> xfun 0.24 2021-06-15 [1] CRAN (R 4.1.0) #> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library ```
aarathybabu97 commented 2 years ago

Mentioned in #15

Discussing with Aarathy some notes on this:

@njtierney Proportion of school goers in the given age group in 2020. Not considering complete(0:24) as the interpolated population for 24+ ages gets excluded.

 abs_education_state %>%
  filter(year==2020)%>%
  group_by(state, age) %>%
  summarise(population_educated = sum(n_full_and_part_time)) %>%
  ungroup() %>%
  complete(
    state,
    age = 0:100,
    fill = list(population_educated = 0)
  )%>%
  mutate(school_age_group=case_when(
    between(age,0,1)~"0-1",
    between(age,2,4)~"2-4",
    between(age,5,16)~"5-16",
    between(age,17,18)~"17-18",
    between(age,19,20)~"19-20",
    TRUE ~ "21+"
  )) %>%
  mutate(school_age_group = factor(school_age_group, levels = c(
    "0-1", "2-4", "5-16", "17-18",
    "19-20", "21+"
  )))%>%
  left_join(abs_state_age_lookup,
            by = c(
              "state",
              "age"
            )
  ) %>%
  group_by(school_age_group) %>% 
  summarise(population_educated = sum(population_educated, na.rm = TRUE),
            population_interpolated = sum(population_interpolated, na.rm = TRUE)) %>% 
  mutate(prop = population_educated / population_interpolated)

#> # A tibble: 6 x 4
#>   school_age_group population_educated population_interpolated     prop
#>   <fct>                          <dbl>                   <dbl>    <dbl>
#> 1 0-1                                0                 622638. 0       
#> 2 2-4                             3224                 940352. 0.00343 
#> 3 5-16                         3696425                3766882. 0.981   
#> 4 17-18                         299641                 630120. 0.476   
#> 5 19-20                           5166                 644791. 0.00801 
#> 6 21+                             2518               19086747. 0.000132
aarathybabu97 commented 2 years ago

Plotting the school goer population and interpolated population of the given age groups. Outlier(21+) likely caused by small population of school goers in that age.

library(ggplot2)
options(scipen = 999)
ggplot(
  school_prop,
  aes(
    x = population_educated,
    y = population_interpolated,
    color=school_age_group
  )
) +
  geom_point() +
  geom_abline()+
  theme(aspect.ratio = 1) + 
  facet_wrap(~state,
             ncol = 4,
             scales = "free_x")