thoughtfulbloke/mapofficialNZ

title: "Making a thematic map of New Zealand in R" author: "David Hood" date: "3/25/2018" output: html_document: keep_md: yes

Goal

This started life as a walk-through on making thematic maps of New Zealand using official boundary files in R, but got a bit distracted by the data along the way. It is still fills that primary purpose though.

I have included all the brutal realities of organizing the data to get it ready to graph, so if you had well organised data and an already setup machine you would be skipping most of the tutorial.

sf prerequisites

Beyond typical R packages, we are using the sf package to understand geographic information files. This means needing to install some extra software on the computer, and this is different for different kinds of computers. If you have not already installed sf, then you need to read the installation instructions

https://r-spatial.github.io/sf/

Libraries

With the perquisites sorted, to follow along with every part in this walk-through the following libraries are needed: rvest, readxl, dplyr, purrr, stringr, tidyr, sf, ggplot2, and viridis. If you do not already have these, you will need to use the install.packages() command in R. This code loads the libraries once they are installed:

library(rvest)
library(readxl)
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
library(ggplot2)
library(sf)
library(viridis)

The data

We are making a map of the number of antibacterial (antibiotic) prescriptions per capita for each District Health Board area of New Zealand. This is being done as a complete example of assembling a data to match official boundaries and mapping it using that boundary information. However, it is going to take information from a number of sources.

New Zealand has 20 District Health Boards, making them a good size for displaying on a map covering the country.

The New Zealand Ministry of Health has recently (at time of writing in March 2018) made available aggregate data about number of prescriptions by District Health Board and year. In New Zealand, Pharmac funds effective medicines from general taxation, which creates an audit trail of what has been prescribed. This is available from:

https://minhealthnz.shinyapps.io/datapharm-beta

using the R commands:

if(!file.exists("perscriptions.zip")){
  download.file("https://minhealthnz.shinyapps.io/datapharm-beta/_w_afb19700/session/f628744ef0b03588f50be51080e7cd44/download/downloadFullData.t1?w=afb19700", destfile="perscriptions.zip")
}
unzip("perscriptions.zip")

Demographic Information about each District Health Board area is available from Stats New Zealand. Rather than downloading the full census data, we are using a summary spreadsheet, available from

http://archive.stats.govt.nz/Census/2013-census/data-tables/dhb-tables.aspx

using the R commands:

if(!file.exists("DHB_demography.xls")){
  download.file("http://archive.stats.govt.nz/~/media/Statistics/Census/2013%20Census/data-tables/dhb/district-health-board-tables.xls",
                destfile="DHB_demography.xls", mode="wb")
}

The list of medicines supported by Pharmac is available in the Pharmac Schedule

https://www.pharmac.govt.nz/healthpros/PharmaceuticalSchedule/Schedule

This does not have a downloadable version, but we can reach out and capture the information in the online resource using the rvest package. First we need a custom function that gets the information about the links on a page and returns them as a set of data.

get_schedule_links <- function(link){
  base_url <- "https://www.pharmac.govt.nz"
  page <- read_html(paste0(base_url, link))
  links <- page %>% html_nodes("a") %>% html_attr('href') 
  linktext <- page %>% html_nodes("a") %>% html_text() 
  start_page <- tibble(links,linktext)
  start_page %>% mutate(on = link) %>% return()
}

The first stage is to get the links fro the starting page. Because the site structure is hierarchical and the way it is organised, links that are part of the schedule have the term "code" in them. To save effort, this will only run if we have not previously scavenged, and saved, the schedule.

if(!file.exists("pharmac_schedule.csv")){
  layer1 <- get_schedule_links("/healthpros/PharmaceuticalSchedule/Schedule") %>%
    filter(str_detect(links, fixed("code"))) %>%
    #two digit code for each step in the chain
    separate(links, into=c("prepend","specific_code"), sep=-3, remove=FALSE) %>%
    select(-prepend)
}

We can take the collection of links and use the map function to get the links from the page each link leads to. Because the structure is hierarchical, and this is reflected in the way the site is set up, we can identify pages deeper in the schedule since the url is an extension of the current pages URL.

if(!file.exists("pharmac_schedule.csv")){
  layer2_all <- layer1 %>% 
    rename(link1 = links, linktext1 = linktext,
           on1=on, specific_code1 = specific_code) %>%
    mutate(nextlayer = map(link1,get_schedule_links)) %>% 
    unnest(nextlayer)
  layer2 <-  layer2_all %>%
    filter(str_detect(links, fixed("code"))) %>% 
    separate(links, into=c("prepend","specific_code"), sep=-3, remove=FALSE) %>% 
    # as it is heirachical, for members in the chain the prepend of the link
    # is the same as current page, this detects wandering off the spider %>%
    filter(prepend == on) %>%
    select(-prepend)
  # test no final (drug details) links with "osq" yet
  layer2_all %>%
    filter(str_detect(links, fixed("osq"))) %>% glimpse()
  # and there are none
}

The two get the next set of links, we repeat the process with the links gathered in the second stage.

if(!file.exists("pharmac_schedule.csv")){
  layer3_all <- layer2 %>% 
    rename(link2 = links, linktext2 = linktext,
           on2=on, specific_code2 = specific_code) %>%
    mutate(nextlayer = map(link2,get_schedule_links)) %>% 
    unnest(nextlayer) 
  layer3 <- layer3_all%>%
    filter(str_detect(links, fixed("code"))) %>%
    separate(links, into=c("prepend","specific_code"), sep=-3, remove=FALSE) %>%
    filter(prepend == on) %>%
    select(-prepend)
  layer3_all%>%
    filter(str_detect(links, fixed("osq"))) %>% glimpse()
  # still 0 detail pages
}

For the final step, the third collection of links is used as the base. The medication pages can be identified by "osq" in their URL. Once gathered, the collective information is saved as pharmac_schedule.csv

if(!file.exists("pharmac_schedule.csv")){
  layer4 <- layer3 %>% 
    rename(link3 = links, linktext3 = linktext,
           on3=on, specific_code3 = specific_code) %>%
    mutate(nextlayer = map(link3,get_schedule_links)) %>% 
    unnest(nextlayer) %>%
    #relevant urls change to include osq
    filter(str_detect(links, fixed("osq")))
  write.csv(layer4, file="pharmac_schedule.csv", row.names = FALSE)
}

The geographic boundaries of District Health Board are available are available from Koordinates (the go to place for New Zealand geographic data files)

https://koordinates.com/layer/4324-nz-district-health-boards-2012/

Downloading files does need (free) registration. I chose to download the file in .kml format

preparing the data

For the demographic data, we read in the excel file, changing some column names to slightly more convenient ones along the way

raw_excel <- read_excel("DHB_demography.xls", "Table 1", skip=8) %>%
  rename(description= X__1, pop2013=`Total people__1`)

However, this data has a key problem. The information about each region is in the same column (as subheadings) as the information about which age group the data is for (as entries). We want the region information to accompany each entry, and we only want the totals for each entry. To do this we take advantage of the regular pattern in the data that the headings have no numerical entries beside them.

population <- raw_excel %>% 
  # if there is nothing in the description column, it is a blank line, so removed
  filter(!is.na(description)) %>%
  # create a new column called DHB, if there is nothing in the numeric pop2013 column,
  # put the description in the new column as it is the DHB heading, otherwise
  # leave it blank
  mutate(DHB = ifelse(is.na(pop2013), description, NA)) %>%
  # repeat the heading in the DHB column down for the blank entries below it
  fill(DHB) %>%
  # keep only the rows that have the text "Total people"" in the description
  filter(str_detect(description, fixed("Total people"))) %>%
  # need to match DHB across sources, so do some individual fiddling to make the names
  # the same as in the prescription data
  mutate(DHB = case_when(
    DHB == "Midcentral" ~ "MidCentral",
    DHB == "Hutt" ~ "Hutt Valley",
    DHB == "Capital and Coast" ~ "Capital & Coast",
    TRUE ~ DHB
  )) %>%
  select(DHB, pop2013)
# don't need the raw_excel file anymore
rm(raw_excel)

For the data in the Pharmac Schedule, we want those medicines classified as being "Antibacterials" at the second stage of the hierarchical schedule.

antibact <- read.csv("pharmac_schedule.csv", stringsAsFactors = FALSE) %>%
  select(linktext1, linktext2, linktext3, linktext) %>% 
  filter(linktext2 == "Antibacterials")

Taking the prescription data, we combine it with the pharmac schedule to get an idea of what the medications were being prescribed for, on the understanding that the entries in the Chemical column of the prescription data match the text in the linktext column in the schedule.

perscriptions <- read.csv("fulldata/Data_ByChemical.csv", stringsAsFactors = FALSE)

combined <- perscriptions %>%
  inner_join(antibact, by=c("Chemical" = "linktext"))

From this data we want the number prescriptions, regardless of specific medication, for each DHB and year. Then we want to blend in the population demographic data from the 2013 census

prepared <- combined %>% 
  filter(Type == "Prescriptions") %>%
  group_by(DHB, YearDisp) %>%
  summarise(total_prescriptions = sum(NumPharms)) %>%
  # now there is one entry per DHB and year
  inner_join(population, by = "DHB") %>% 
  #get rid of the country-wide "New Zealand" entry
  filter(!is.na(YearDisp) & DHB != "New Zealand") %>% 
  # divide prescriptions by people to get the per capita
  mutate(per_cap = total_prescriptions / pop2013, 
         Year = YearDisp - 2000,#(short form of year to fit on axis better)
         Island = case_when( # Note South Island vs North Island
           DHB == "Southern" ~ "S",
           DHB == "South Canterbury" ~ "S",
           DHB == "Canterbury" ~ "S",
           DHB == "West Coast" ~ "S",
           DHB == "Nelson Marlborough" ~ "S",
           TRUE ~ "N"
         )
) %>% ungroup()

Not the expected story

To check that things were working I made a graph of the per capita / year pattern for each region

ggplot(prepared, aes(x=Year, y=per_cap, colour=DHB)) +
  geom_point() + geom_line() + facet_wrap(~DHB, ncol=5) +
  theme_minimal() + ylab("Prescriptions per Capita") + xlab("Year (2012-2016)") + 
  ggtitle("Non-antibacterial prescriptions per capita (2013 census pop.) by DHB") +
  theme(legend.position="none")

For those that know the locations of the District Health Boards from the names, they will recognize that northerly DHBs are prescribing antibacterials at nearly twice the per capita rate as the west coast of the South Island, and there is a general south-north gradient of prescription rates (For those that don't know locations, see the map at the end). This is very unexpected.

To measure the unexpectedness of it, I ran a perturbation test between the North and South Island. The logic works like this:

We have 100 rate observations, 5 years for 20 DHBs. 25 are for the South Island, 75 are for the North Island.
If we add together the rates, we get a number representing a year-areas for each (and if we subtract the South Island from the North, we get an expression of how different they are)
If the variation was random, any observation was equally likely to occur in any time/place.
We can repeatedly randomly assign the values to a time/place, and see how often the differences between islands are as great (or greater) than reality.

This idea is broadly similar to the infer package for R and the idea "there is only one test" http://allendowney.blogspot.co.nz/2016/06/there-is-still-only-one-test.html

In this case we are not testing the size of the difference, just the unlikeliness of the arrangement (and the consistency of that arrangement in time). We can simplify the problem even more by observing:

As we are randomly assigning from a fixed pool of possible outcomes, in order for the the difference between the islands to be equal to or greater than reality, the total for the South Island needs to be equal to or lower than the real total for the South Island (as the complimentary values are going to be assigned to the North Island and the overall total is always the same), so we only need to make the 25 draws for the South Island.

Step 1, work out the aggregate "total of rates"

prepared %>% group_by(Island) %>% summarise(abr= sum(per_cap))

## # A tibble: 2 x 2
##   Island   abr
##   <chr>  <dbl>
## 1 N       75.8
## 2 S       19.5

We now make 1 million random draws of 25 of the rates (representing hypothetical South Islands) and calculate the sum of each draw. Then we see in how many cases the sum of the random draw was as low (or lower) than reality. Making 25 million random draws takes a few moments.

SI_perturb <- replicate(1000000, sum(sample(prepared$per_cap, 25, replace=FALSE)))
sum(SI_perturb <= 19.5)

## [1] 0

min(SI_perturb)

## [1] 20.75628

When I ran the code, there were zero of the million times times a random arrangement was a low as the true South Island. The lowest value at random was 20.7562762 which is well above 19.4888. So we can conclude that the geographic arrangement, and its consistency over time, is not random.

Making the map

Coming back to making the map, we read in the map file

DHB_map <- st_read("kx-nz-district-health-boards-2012-KML/nz-district-health-boards-2012.kml", quiet=TRUE)

To match the data to the map, we want the data to have the same heading (in this case "Name") as the identifiers in the map, and have the entries formatted in the same way. So we do a little tidy up of our data. We also pick only one year to avoid complications about how to represent multiple times in one place, and since the census was 2013 this is the most accurate year for population information.

formap <- prepared %>% filter(Year == 13) %>% rename(Name = DHB) %>%
  mutate(Name = case_when(
    Name == "Capital & Coast" ~ "Capital and Coast",
    Name == "Hutt Valley" ~ "Hutt",
    Name == "MidCentral" ~ "Midcentral",
    TRUE ~ Name
  )) %>%
  select(Name, per_cap)

Now we take a moment to think about colour schemes. As the data ranges from 0.6314234 to 1.3002559, thinking in terms of 71 .01 steps, with a legend displaying 10 .7 steps, seems reasonable. For each entry, we calculate how far along the spectrum it is, starting from 1 as the lowest. Then we build a colour palette with a range we like.

formap <- formap %>%
  mutate(colour_step = round(100*(formap$per_cap - .63), 0) +1)
col_palette <- viridis(71, begin = .3, end = .95)

We join the data to the map, having made sure we have an entry for each item we want mapped.

themap <- formap %>% inner_join(DHB_map, by="Name")

We make the map, taking advantages of New Zealand's tall thin shape to have a large legend.

legend_steps <- rev(c(1,8,15,22,29,36,43,50,57,64,71))
legend_text <- ((legend_steps - 1) / 100) + .63

plot(st_geometry(themap$geometry), lty="blank",
     col = col_palette[themap$colour_step],
     main="Antibacterial perscriptions per capita, 2013")
legend("topleft", legend=legend_text, fill=col_palette[legend_steps], bty="n",
       title="Prescriptions per capita")