tomwhite / covid-19-uk-data

Coronavirus (COVID-19) UK Historical Data
http://tom-e-white.com/covid-19-uk-data/
The Unlicense
162 stars 79 forks source link
confirmed-cases coronavirus covid-19 csv daily-counts data dataset deaths england historical-data northern-ireland scotland uk united-kingdom wales

COVID-19 UK Historical Data

:warning: Update: 1 August 2020. This repository is deprecated and is no longer updated. Users are encouraged to move to official upstream data sources which are listed below :warning:

Data on numbers of tests, confirmed cases, and deaths for coronavirus (COVID-19) in the UK is published by the government, but it is fragmented and not always provided in consistent or machine-friendly formats. Also, in many cases only the latest numbers are available so it's not possible to look at changes over time.

This site collates the historical data and provides it in an easily consumable format (CSV), in both wide and tidy data forms.

Ideally the data publishers will start doing this so this site becomes redundant.

Data files

The following CSV files are available (note they are no longer updated):

Interpreting the numbers (more information on this DHSC/PHE page, and the PHE dashboard about page)

Note that the totals for the UK don't necessarily equal the sum of the totals of the four nations (England, Scotland, Wales, Northern Ireland), due to differences in date reported.

You can use these files without reading the rest of this document.

There is an experimental Datasette instance hosting the data. This is useful for running simple SQL on the data, or exporting in JSON format.

News

Data sources

The following sources may include more data than described here. This summary includes only Tests, Confirmed cases and Deaths.

UK

England

Scotland

Wales

Northern Ireland

Local Authority and Health Board metadata

Related projects/datasets

Wishlist

Here are my suggestions for how to improve the data being published by public bodies.

The short version: publish everything in CSV format, and include historical data!

The reporting systems have changed a lot since the outbreak began, and overall they have improved, both in the amount of information being published, and the ease of access of machine-readable datasets. (Public Health Scotland provides all their data in XLSX and CSV format, including historical data. Public Health Wales provides a XLSX spreadsheet with historical data.)

Tools

There are command line tools for downloading, parsing, and processing the data. They rely on Python 3.

To install the tools, create a virtual environment, activate it, then install the required packages:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Daily workflow

A sqlite DB is now used to store and aggregate intermediate data. The CSV files remain the point of record.

The crawl tool will see if the reseouce (webpage, date file) has already been downloaded, and if it hasn't download it if it's available for the specified date (today). (If not available the tool will exit.) If available, the tool will then extract the relevant information from it and update the sqlite database. This means that you can just run crawl until it finds new updates.

The convert_sqlite_to_csvs tool will extract the data from sqlite and update the CSV files.

The updates tool runs crawl then convert_sqlite_to_csvs, and issues interactive prompts for if you want to commit the changes to git.

There is also a crawl_all tool (and corresponding update_all tool) that uses machine-readable sources to update all historical data for that source. This is not available for all sources yet.

./tools/update_all.sh phw
./tools/update_all.sh phs
./tools/update.sh NI
./tools/update.sh UK
./tools/update_all.sh phe

The equivalent done manually (just for Wales):

DATE=$(date +'%Y-%m-%d')
./tools/crawl.py $DATE Wales
./tools/convert_sqlite_to_csvs.py
git add data/; git commit -am "Update for $DATE for Wales"

NI updates are being done manually since there are currently no machine-readable sources.

# edit covid-19-totals-northern-ireland.csv and add tests/cases/deaths
./tools/convert_totals_to_indicators.py
csvs-to-sqlite --replace-tables -t indicators -pk Date -pk Country -pk Indicator data/covid-19-indicators-uk.csv data/covid-19-uk.db
./tools/convert_sqlite_to_csvs.py
git commit -a # "Update for xxx for NI from https://twitter.com/healthdpt"

Updates are not always made at a consistent time of day, so the following command can be run continuously in a terminal to check for updates every 10 minutes. The -b option makes it beep if there is a new update.

watch -n 600 -b ./tools/crawl.py

Check data consistency

./tools/check_indicators.py
./tools/check_totals.py

Manual overrides

Sometimes it's necessary to fix data by hand. In this case the following tools are useful:

Repopulate the sqlite database from the CSV files:

rm data/covid-19-uk.db
csvs-to-sqlite --replace-tables -t indicators -pk Date -pk Country -pk Indicator data/covid-19-indicators-uk.csv data/covid-19-uk.db
csvs-to-sqlite --replace-tables -t cases -pk Date -pk Country -pk AreaCode -pk Area data/covid-19-cases-uk.csv data/covid-19-uk.db