lindsaykatz / hansard-proj

Materials for the Digitization of the Australian Parliamentary Debates (1998-2022).
0 stars 1 forks source link

Hansard project

This repository contains all materials relating to the Digitization of the Australian Parliamentary Debates (1998-2022). The most recent version of our database is available for download on Zenodo.

Workflow

To produce the most recently published version of our dataset, we used the following workflow, where all R scripts can be found in our code folder.

Example Code

This is an example of how to load one file from our dataset into R, shown for the CSV and Parquet file formats.

library(tidyverse)
library(arrow)

# csv
hansard_csv <- readr::read_csv("hansard-daily-csv/2000-06-05.csv", 
                               col_types = list(name = col_character(),
                                                order = col_double(),
                                                speech_no = col_double(),
                                                page.no = col_double(),
                                                time.stamp = col_character(),
                                                name.id = col_character(),
                                                electorate = col_character(),
                                                party = col_factor(),
                                                in.gov = col_double(),
                                                first.speech = col_double(),
                                                body = col_character(),
                                                fedchamb_flag = col_factor(),
                                                question = col_factor(),
                                                answer = col_factor(),
                                                q_in_writing = col_factor(),
                                                div_flag = col_factor(),
                                                gender = col_factor(),
                                                uniqueID = col_character(),
                                                interject = col_factor(),
                                                partyfacts_id = col_double()))

# parquet
hansard_parquet <- arrow::read_parquet("hansard-daily-parquet/2000-06-05.parquet")

The following code shows you how to read in the full corpus of data in Parquet format, filter for particular dates of interest (in this case, all available Hansard data from the 1990s which in our database is 1998 and 1999), and then split each sitting day's data into a separate tibble, stored as a list.

hansard_corpus <- arrow::read_parquet("hansard-corpus/hansard_corpus_1998_to_2022.parquet")

hansard_1990s <- hansard_corpus |> 
  filter(str_detect(date, "^199")) |>  
  group_split(date)

If you wish to filter out stage directions, you can do so with the following code, which also updates the order variable to reflect the new ordering of the filtered dataframe.

hansard_csv |> 
  filter(name!="stage direction" & name!="business start") |> 
  select(-order) |> 
  rowid_to_column("order")

Below is an example of how to merge debate topics with the Hansard dataframe.

First, read in the all_debate_topics file, and then filter for the sitting day of interest. This corresponds to the date of the Hansard file we already read in, which is from 2000-06-05.

We then group the topics dataframe by page number, and summarise the title variable into a list form. This is done because often multiple debate titles have the same page number, and multiple rows of Hansard proceedings have the same page number, and there is no straightforward way of knowing exactly which row of the Hansard data corresponds to which debate title.

Finally, we ungroup the data and then right join onto the Hansard dataframe by page number. We use the multiple="all" argument because this allows each row in topics to match multiple rows of the Hansard data. In other words, since multiple rows in the Hansard data have the same page number, they will join to the same row in the topics data. This can also be done similarly with the full corpus.

topics <- arrow::read_parquet("hansard-supplementary-data/all_debate_topics.parquet")

topics |> filter(date=="2000-06-05") |> 
  group_by(page.no) |> 
  summarise(title = list(title)) |> 
  ungroup() |> 
  right_join(hansard_parquet, by = "page.no", multiple = "all")

URLs folder

Accessing original data

As at July 2023, the Hansard website navigated to obtain the datasets that we need. Begin by going to: https://www.aph.gov.au/Parliamentary_Business/Hansard. Then "House Hansard" from the menu on the right. Click "back to 1901." Each day's content is grouped within decades, which can be navigated on the menu on the left.