This repository presents additonal material related to my master thesis The Impact of COVID-19 Restrictions on Book Consumption. Specificically, this repository contains the workflow for data preperation and analysis used for my thesis.
In my thesis, I have investigated how COVID-19 restrictions have affected the amount people read, consumers’ reading speed, evaluation of books and types of books read and how these effects vary across age groups, genders, types of readers and nationalities. The expected relationships that have been investigated are shown below:
To investigate the impact of COVID-19 restrictions on book consumption, we use data scraped from the reading community website Goodreads. We collected 18,252,877 book reading records from 112,087 unique Goodreads users that were found via the 31 largest country-specific subgroups on Goodreads. Our dataset covers the consumption of books over a 15-year timeframe, including almost two years after the outbreak of COVID-19.
├── README.md
├── makefile
├── Verweij (2022).pdf
├── .gitignore
├── data
├── gen
| ├── temp
| └── output
└── src
├── analysis
├── data-collection
└── data-preparation
Please follow the installation guide on http://tilburgsciencehub.com/.
Please follow the installation guide on http://tilburgsciencehub.com/.
Python. Installation guide.
Make. Installation guide.
For Python, make sure you have installed below packages:
pip install bs4
pip install selenium
For R, make sure you have installed below packages:
install.packages("tidyverse")
install.packages("googledrive")
install.packages("data.table")
install.packages("readxl")
Follow below instructions to run the code:
git clone https://github.com/[your username]/covid-19-book-consumption.git
covid-19-book-consumption
cd yourpath/covid-19-book-consumption
to change your directorymake
Note: Above worflow does not include the data collection steps and the combination of the scraped data file. The reason for this is two-fold. First, the data collection steps take about 3.5 months to completely run. Hence, it would not be efficient to include in the reproduction workflow. Second, since the source code of Goodreads is not static but rather dynamic, the data scraper program had to be slightly adjusted several times during the process. Therefore, the running process was cut into smaller chuncks such that we could find out about possible problems as soon as possible and could timely adjust the programm. Therefore, the data scraping software collected multiple seperate files that were later combined into larger files.
Below shows an overview of the order in which these programms were run: