udsleeds / openinfra

Open access data for transport research: tools, modelling and simulation
https://udsleeds.github.io/openinfra/
Other
31 stars 4 forks source link

Updated regions - population stats by geographical area #143

Open hulsiejames opened 2 years ago

hulsiejames commented 2 years ago

As discussed in #137

I've followed your workflow in the script and agree that population values at different geographical levels will be very useful for further detailed analysis like:

amount of infra per million km walked/cycled per year

I've updated the regions.R script and had a go at getting population figures by LAD and then TA Region Name which has mostly been a sucess:

LAD populations Screenshot from 2022-09-22 18-20-07 TA Region populations Screenshot from 2022-09-22 18-01-31

# Local Authority (LA) populations -----------------------------------------

# Groups MSOAs by LAs to obtain LA populations
LA_populations = msoas_population_joined |>
  dplyr::group_by(`LA name (2018 boundaries)`) |> 
  # For columns (All Ages --> 90+), sums MSOA pops to obtain LA pops
  dplyr::summarise(across(`All Ages`:`90+`, sum))

# Rename to more appropriate column name
LA_populations = LA_populations %>% 
  dplyr::rename(la_name = `LA name (2018 boundaries)`)

# Plots LAs
tmap::tm_shape(LA_populations) + 
  tmap::tm_dots(col = "la_name")

sf::st_write(LA_populations, "data/LA_population_stats_by_age_UK_NI_2022.geojson")

# Using LA --> Transport regions lookup sent to me by Robin the other day, can 
# we report on region_level stats too? 

# Transport Authority (TA) Region level populations -----------------------

# Load LAD --> TA Region lookup table
lad_ta_region_lookup = read_csv("data-small/lad_ta_region_lookup_atf3.csv")

# Join TA Region Names to LADs so we can group by TA region
region_populations = left_join(lad_ta_region_lookup, LA_populations,
                               by = c("LAD22NM" = "la_name"))
# Group by TA Region name & sum LAD populations to obtain TA Region populations
region_populations = region_populations %>% 
  dplyr::group_by(Region_name) %>% 
  dplyr::summarise(across(`All Ages`:`90+`, sum))
sf::st_write(region_populations, "data/region_population_stats_by_age_UK_NI_2022.geojson")

NA_region_pops = region_populations %>% dplyr::filter(is.na(`All Ages`))
# A number of regions have NA - I think this is due to changes in admin boundaries over time and the fact
# that I am using 2020 population data in combination with 2022 LAD-->TA Region Name lookups. 

As mentioned above, there are a numer of region names that have NA population figures. Screenshot from 2022-09-22 18-07-45

As mentioned, I do think this is the result of me using two differently dated sources (I think I have used 2020 population figures with corresponding boundaries, but I use the new 2022 LAD-> TA Region csv you linked the other day). This is an easy fix, just get data sources for the same year.

I'll look at this tomorrow, as haven't had the most in depth look today as been busy.

But unless ONS have published all data for 2022 (I saw your tweet the other day on LADs being updated throughthe ONS open data portal) I may have to find an older year that has files for populations, region boundaries (both LAD and TA region) and LAD --> TA Region lookups to perform group_bys


Looking a bit more, it seems the NA TA region populations are due to regions that exist in the 2022 lad_ta_region_lookup (i.e. Buckinghamshire, Dorset) but do not exist in the 2020 LA_populations dataset

The issue is arising as I have used the LA-populations dataset to group LA's that beong to a region to derive region level populations.

Just need same dated datasets. Looking into tomorrow.

Robinlovelace commented 2 years ago

You could also use sf::st_join() to join the MSOA data (with geometries converted to points with sf::st_point_on_surface()) onto the LADs to get population estimates per LAD and per region for 2020.