Impact of COVID-19 Restrictions on Book Consumption

This repository presents additonal material related to my master thesis The Impact of COVID-19 Restrictions on Book Consumption. Specificically, this repository contains the workflow for data preperation and analysis used for my thesis.

In my thesis, I have investigated how COVID-19 restrictions have affected the amount people read, consumers’ reading speed, evaluation of books and types of books read and how these effects vary across age groups, genders, types of readers and nationalities. The expected relationships that have been investigated are shown below:

Data Description

To investigate the impact of COVID-19 restrictions on book consumption, we use data scraped from the reading community website Goodreads. We collected 18,252,877 book reading records from 112,087 unique Goodreads users that were found via the 31 largest country-specific subgroups on Goodreads. Our dataset covers the consumption of books over a 15-year timeframe, including almost two years after the outbreak of COVID-19.

Repository overview

├── README.md
├── makefile
├── Verweij (2022).pdf
├── .gitignore
├── data
├── gen
|   ├── temp
|   └── output
└── src
    ├── analysis
    ├── data-collection
    └── data-preparation

Dependencies

Please follow the installation guide on http://tilburgsciencehub.com/.

Python. Installation guide.
R. Installation guide.
Make. Installation guide.
For Python, make sure you have installed below packages:
```
pip install bs4
pip install selenium
```

For R, make sure you have installed below packages:

install.packages("tidyverse")
install.packages("googledrive")
install.packages("data.table")
install.packages("readxl")

Running the code

Follow below instructions to run the code:

Fork this repository
Open your command line/terminal:

git clone https://github.com/[your username]/covid-19-book-consumption.git

Make sure your current working direcotry is covid-19-book-consumption
If not, type cd yourpath/covid-19-book-consumption to change your directory
Execute following command to execute the workflow:

make

Running the data collection

Note: Above worflow does not include the data collection steps and the combination of the scraped data file. The reason for this is two-fold. First, the data collection steps take about 3.5 months to completely run. Hence, it would not be efficient to include in the reproduction workflow. Second, since the source code of Goodreads is not static but rather dynamic, the data scraper program had to be slightly adjusted several times during the process. Therefore, the running process was cut into smaller chuncks such that we could find out about possible problems as soon as possible and could timely adjust the programm. Therefore, the data scraping software collected multiple seperate files that were later combined into larger files.

Below shows an overview of the order in which these programms were run:

../src/data-collection -> large_country_scraper.py (runs one country at a time)
../src/data-collection -> small_countries_scraper.py (runs all countries with <3,000 members at once)
merge the scraped country files into one file and save it as 'all_users.csv'
../src/data-collection -> user_info_scraper.py (scrapes the detailed user info of all user accounts)
../src/user_info_clean.R (remove users that we cannot scrape, i.e. private users and authors)
../src/data-collection -> books_scraper.py (scrapes all books of the publicly available users)

Author

Mike Verweij

hannesdatta / covid-19-book-consumption

readme