The UCLA Law Covid-19 Behind Data Project, launched in March 2020, tracks the spread and impact of COVID-19 in American carceral facilities and pushes for greater transparency and accountability around the pandemic response of the carceral system.
Part of this project includes scraping state DOC websites, the federal BOP website, and county jail websites. For each of these websites, we pull data use a unique scraper which performs the following tasks:
You can find each of our scrapers in the folder production/scrapers
. More detailed documentation can be found here for each of our scrapers. If you would like to recreate the documentation with the latest information you can run the function document_all_scrapers()
. In order to run these scrapers, you will need to install the libraries listed at the top of the file R/generic_scraper.R
as well as our teams own library behindbarstools. Individual scrapers may require additional libraries which are listen in the individual scraper files themselves through explicit library calls.
tabulizer
and its dependencies. To install, run remotes::install_github(c("ropensci/tabulizerjars", "ropensci/tabulizer"))
pandoc
on your machine. To install, run brew install pandoc
EXTRACTABLE_API_KEY
) and PERMACC (PERMACC_API_KEY
).ssh/config
:
host {{hostname}}
HostName {{address}}
IdentityFile {{~/.ssh/some_private_key}}
User {{username}}
This project stores its data in a submodule. When you first clone this project, you get the data directory which should contain the data submodule, but none of the files within yet.
git submodule init
to initialize your local configuration file. git submodule update
to fetch all the data. The following steps should be completed in order to ensure the scraper will run properly.
# Pull latest in the main repository
cd covid19_behind_bars_scrapers
git checkout master
git pull origin master
# Pull latest in the data submodule
cd data
git pull origin master
# Return the main repository
cd ..
production/pre_run.R
which will check to see if you have the appropriate packages and install them for you if missing, install the latest version of our teams package behindbarstools
, and check to see if you have enough extractable credits to run the scrapers. Rscript production/pre_run.R
/tmp/sel_dl
. This allows for the files that are downloaded through the docker selenium image to also appear on the host system. You can start the image by running: mkdir /tmp/sel_dl
docker run -d -p 4445:4444 -p 5901:5900 -v /tmp/sel_dl:/home/seluser/Downloads \
selenium/standalone-firefox:latest
STOP! DID YOU UPDATE THE MANUAL SCRAPER DATA YET?: Visit the manual data Google Sheet and update the sad scrapers for which we must use our own eyes.
covid19_behind_bars_scrapers.Rproj
lives, and run the following command.Rscript production/main.R
Side Note 1: If we want to only save a record of the raw COVID data and make a carbon copy of the websites hosting COVID data for that day and not extract the information we can run a limited version of the scraper which only pulls and saves raw data as shown below.
Rscript production/main.R --raw_only
Side Note 2: Sometimes we will attempt to run a scraper and the scraper will be unable to extract a particular value. When this happens we will occasionally want to manually change the value for a particular facility's column value after the extraction has occurred for the scraper. To do this, go to a particular scrapers file and run the individual scraper using the code at the bottom of the file. After running that scraper's extract_from_raw
method, select the column name and facility for which you would like to manually change the data using the following method. Doing so will log the occurance of the manual change and keep a record of all the changes we are making by hand. Note that this method will only allow you to change data which is stored in the extract_data
slot.
scraper$manual_change(
column = "Some.Column", facility_name = "Facility name here", new_value = 9)
After calling the manual_change
method, you will need to re-run the validate_extract
and save_extract
methods to update the extracted data. Updates made with the manual_change
method are the only updates that should NOT be committed. All other code updates for a particular scraper run should be reflected in the commit for that day.
Rscript production/post_run.R
Inspect diagnostics: Look at the diagnostics file, and stop if there's anything crazy happening!
Commit changes: Be sure to commit your changes to the master branch of both the covid19_behind_bars_scrapers
repo and the data
submodule. Note that this will require two commits.
# check the differences between the new data and the old google sheet
# if things looks good then commit all changes
# FIRST commit changes in data submodule
cd data
git add -A
git commit -m "update: the/current/date run of scraper"
git push origin master
# NEXT commit changes in scraper code repo
cd ..
git add -A
git commit -m "update: the/current/date run of scraper"
git push origin master