google-research / open-covid-19-data

Open source aggregation pipeline for public COVID-19 data, including hospitalization/ICU/ventilator numbers for many countries.
Apache License 2.0
115 stars 69 forks source link

Note: This project is not currently being updated. Also, it is distinct from Google's COVID-19 Open-Data project, which has similar goals, and is being actively developed and maintained by a team spanning Google Health, Google Research, Google Cloud, Google Maps, etc.

Open COVID-19 Data

Google Research's Open COVID-19 Data project is an open source pipeline that aggregates public COVID-19 data sources into a single dataset. The data includes time series data for COVID-19 cases, deaths, tests, hospitalizations, discharges, intensive case unit (ICU) cases, ventilator cases, government interventions, and Google's Community Mobility Reports and Search Trends symptoms dataset.

Table of Contents

About

COVID-19 data is published from many distinct sources with highly heterogenous formats. The goal of this pipeline is to accept data in many different formats, and to process it into a standardized and consistent schema. Having data in a consistent schema allows researchers to build models quickly, while the pipeline is designed for engineers to add new data sources quickly.

The pipeline supports three ways of ingesting data:

For each data source, this repository has a configuration file located in src/config/sources that specifies how the pipeline should map the original data into our schema. Raw data is fetched from the data source and written into a directory within data/inputs. Exported data that has been transformed into our schema is found in the data/exports directory.

Using the data

Latest data

If you just want to use the latest data for models, visualizations, or research, we provide aggregated data files under different licenses. This is to provide you with options so that you can use data with a license that is acceptable for your use case, while respecting the original licenses of the data sources.

Attributions and Licenses

Please see the Data Sources section of this README to note the attributions and licenses for each source.

Data Schema

Locations

Every location is assigned an open_covid_region_code, which is a unique hierarchical location code that can be used to join data across tables in this repository. The full list of locations that are assigned an open_covid_region_code can be found at data/exports/locations/locations.csv. Where available, we also provide a datacommons_id and wikidata_id field for each location.

Each open_covid_region_code has up to three levels:

Dates

All dates are mapped to ISO 8601 format during data loading, e.g. 2020-08-15.

For Data Owners

We have carefully checked the license and attribution information on each data source included in this repository, and in many cases have contacted the data owners directly to ask how they would like to be attributed.

If you are the owner of a data source included here and would like us to remove data, add or alter an attribution, or add or alter license information, please do not hesitate to email us at open-covid-19-data@google.com and we will happily consider your request.

Development

If you would like to run the pipeline locally or to contribute to the codebase, here are instructions for installation and adding new data sources.

Installation

To install Python dependencies:

pip install pandas xlrd pyyaml python3-wget

Usage

To run the main script that runs the entire pipeline on the data that is in data/inputs:

python src/scripts/export_data.py

In addition, there are two scripts that can be run to fetch new data and write it into data/inputs.

To fetch data that can be automatically downloaded:

python src/scripts/fetch_automatic_downloads.py

To fetch data from a spreadsheet in data/inputs/scraped/spreadsheets/:

python src/scripts/fetch_scraped_data.py

Pipeline Structure

The pipeline is structured so that raw data is always fetched into data/inputs before being consumed by the rest of the pipeline. Data sources for each data type are then loaded into pandas dataframes with a standardized schema for dates, locations, and columns. These dataframes are joined into a single dataframe, which is then exported. pipeline

Adding a new data source

Before adding a new data source, we go through an internal approval within Google to ensure compliance with licensing and terms. Once a data source is approved, you can add the data to the pipeline as follows:

1. Register new data types in src/config/data.yaml:
2. Add a new yaml file to src/config/sources.
3. Update docs and licenses:

Authors

This repository is created and maintained by Katie Everett, Dan Nanas, Maddy Myers (UCSD), Sumit Arora, and Ian Fischer.

Data Sources

Australia

Source name: covid19data.com.au (link)
Link to data: https://www.covid19data.com.au/hospitalisations-icu
Description: Data is scraped manually from the charts provided at the source link. Data for Australia consists of time series data for current hospitalizations, ICU and ventilator cases.
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-12-23

COVIDTracking

Source name: COVID-19 Tracking Project (link)
Link to data: https://github.com/COVID19Tracking/covid-tracking-data/tree/master/data
Description: Data is downloaded automatically from the source link. Data for the United States consists of time series data for current and cumulative hospitalizations.
License: Apache 2.0 (link)
Last accessed: 2020-12-14

Colombia

Original data source: GOV.CO (link)
Link to original data: https://www.datos.gov.co/Salud-y-Protecci-n-Social/Casos-positivos-de-COVID-19-en-Colombia/gt2j-8ykr/data
Data aggregated by: COVID-19 Colombia (link)
License: Creative Commons Attribution-ShareAlike 4.0 International (link)
Last accessed: 2020-12-14

Czech Republic

Source name: National Health Information System, Regional Hygiene Stations, Ministry of Health of the Czech Republic (link)
Link to data: https://onemocneni-aktualne.mzcr.cz/covid-19
Description: Data is scraped manually from the charts provided at the source link. Data for the Czech Republic consists of time series data for current ICU cases, and current and cumulative hospitalizations.
Citation:

Komenda M., Karolyi M., Bulhart V., Žofka J., Brauner T., Hak J., Jarkovský J., Mužík J., Blaha M., Kubát J., Klimeš D., Langhammer P., Daňková Š ., Májek O., Bartůňková M., Dušek L. COVID ‑ 19: Přehled aktuální situace v ČR. Onemocnění aktuálně [online]. Praha: Ministerstvo zdravotnictví ČR, 2020 [cit. 25.04.2020]. Dostupné z: https://onemocneni-aktualne.mzcr.cz/covid-19. Vývoj: společné pracoviště ÚZIS ČR a IBA LF MU. ISSN 2694-9423.

Last accessed: 2020-12-23

Denmark

Source name: Statens Serum Institute (link)
Link to data: https://www.sst.dk/da/corona/tal-og-overvaagning
Description: Data is manually scraped from charts at the source link. Data for Denmark consists of time series data for current hospitalizations and ICU cases.
Last accessed: 2020-12-23

Finland

Source name: Finnish institute for health and welfare (link)
Link to data: https://thl.fi/en/web/infectious-diseases/what-s-new/coronavirus-covid-19-latest-updates
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-12-23

France

Source name: data.gouv.fr (link)
Link to data: https://www.data.gouv.fr/en/datasets/donnees-hospitalieres-relatives-a-lepidemie-de-covid-19/
Description: Data is scraped manually from the charts provided at the source link. Data for France consists of time series data for cumulative hospitalizations and ICU cases.
License: Open License 2.0 (link)
Last accessed: 2020-12-14

Google's COVID19 Community Mobility Reports

Source name: Google's COVID19 Community Mobility Reports (link)
Link to data: https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv
Help Center: https://support.google.com/covid19-mobility
Description: These Community Mobility Reports aim to provide insights into what has changed in response to policies aimed at combating COVID-19. The reports chart movement trends over time by geography, across different categories of places.
Terms: In order to download or use the data or reports, you must agree to the Google Terms of Service.
License: Google Terms of Service (link)
Citation:

Google LLC "Google COVID-19 Community Mobility Reports".
https://www.google.com/covid19/mobility/ Accessed: <date>.

Last accessed: 2020-08-28

Google's COVID19 Search Trends symptoms dataset

Source name: Google's COVID19 Search Trends symptoms dataset (link)
Link to data: http://goo.gle/covid19symptomdataset
Description: The COVID-19 Search Trends symptoms dataset shows aggregated, anonymized trends in Google searches for a broad set of health symptoms, signs and conditions. The dataset provides a daily or weekly time series for each region showing the relative volume of searches for each symptom.
Terms: In order to download or use the data or reports, you must agree to the Google Terms of Service.
License: Google Terms of Service (link)
Citation:

Google LLC "Google COVID-19 Search Trends symptoms dataset".
http://goo.gle/covid19symptomdataset, Accessed: <date>.

Last accessed: None

Iceland

Source name: Directorate of Health in Iceland (Embaetti landlaeknis) (link)
Link to data: https://www.covid.is/data
Description: Data is downloaded manually from the source link. Data for Iceland consists of time series data for current ICU cases, and current and cumulative hospitalizations.
Last accessed: 2020-06-22

Ireland

Source name: Health Protection Surveillance Centre (link)
Link to data: https://www.hpsc.ie/a-z/respiratory/coronavirus/novelcoronavirus/casesinireland/epidemiologyofcovid-19inireland/
Description: Data is scraped manually from daily situation reports. Data for Ireland consists of time series data for cumulative hospitalizations.
License: Creative Commons Attribution ShareAlike 3.0 (link)
Last accessed: 2020-12-23

Italy

Source name: Dipartimento della Protezione Civile (link)
Link to data: https://github.com/pcm-dpc/COVID-19
Description: Data is downloaded automatically from the source repository. Data for Italy consists of time series data for current hospitalizations, but we can also compute cumulative hospitalizations.
License: Creative Commons Attribution 4.0 International (link)
Last accessed: 2020-12-14

Japan

Source name: Toyo Keizai Online (link)
Link to data: https://github.com/kaz-ogiwara/covid19
Copyright notice: Copyright (c) 2020 Kazuki OGIWARA / 荻原 和樹
Description: Data is downloaded automatically from the source repository. Data for Japan consists of time series data for current hospitalizations and ICU cases.
License: MIT (link)
Last accessed: 2020-08-03

Luxembourg

Source name: Luxembourg Ministry of Health (link)
Link to data: https://data.public.lu/fr/datasets/donnees-covid19/#_
Description: Data is downloaded automatically from the source link. Data for Luxembourg consists of time series data for current hospitalizations and ICU cases.
License: Creative Commons Zero 1.0 Universal (link)
Last accessed: 2020-11-23

Moldova

Source name: Ministry of Health, Labour and Social Protection (link)
Link to data: https://msmps.gov.md/ro/advanced-page-type/comunicate-de-presa
Last accessed: 2020-12-23

Netherlands

Source name: National Institute for Public Health and The Environment (link)
Link to data: https://www.rivm.nl/coronavirus-covid-19/grafieken
Description: Data is downloaded manually from the source link. Data for the Netherlands consists of time series data for current hospitalizations.
Last accessed: 2020-06-29

New Zealand

Source name: New Zealand Ministry of Health (link)
Link to data: https://www.health.govt.nz/our-work/diseases-and-conditions/covid-19-novel-coronavirus/covid-19-current-situation/covid-19-current-cases
Last accessed: 2020-12-23

Norway

Source name: Norwegian Institute of Public Health (link)
Link to data: https://www.fhi.no/en/id/infectious-diseases/coronavirus/daily-reports/daily-reports-COVID19/
Last accessed: 2020-06-22

Our World in Data

Source name: Our World in Data (link)
Link to data: https://github.com/owid/covid-19-data/tree/master/public/data
License: Creative Commons Attribution 4.0 International (link)
Citation:

Data from Our World in Data has been collected, aggregated, and documented by Diana Beltekian, Daniel Gavrilov, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Edouard Mathieu, Esteban Ortiz-Ospina, Hannah Ritchie, and Max Roser.

Last accessed: 2020-12-14

Oxford Covid-19 Government Response Tracker

Source name: Oxford Covid-19 Government Response Tracker (link)
Link to data: https://github.com/OxCGRT/covid-policy-tracker/blob/master/data/OxCGRT_latest.csv
License: Creative Commons Attribution 4.0 International (link)
Citation:

Thomas Hale, Sam Webster, Anna Petherick, Toby Phillips, and Beatriz Kira. (2020). Oxford COVID-19 Government Response Tracker. Blavatnik School of Government.

Last accessed: 2020-12-14

Philippines

Source name: Philippines Department of Health (link)
Link to data: http://www.doh.gov.ph/covid19tracker
Last accessed: 2020-12-23

Spain

Source name: Ministerio de Sanidad, Consumo y Bienestar Social (link)
Link to data: https://cnecovid.isciii.es/covid19/resources/agregados.csv
Description: The data is downloaded automatically from the source link. Due to regional differences in hospitalization reporting, we do not aggregate across regions to produce country-level statistics for Spain.
Last accessed: 2020-12-14

Sweden

Source name: Public Health Agency of Sweden (link)
Link to data: https://www.arcgis.com/sharing/rest/content/items/b5e7488e117749c19881cce45db13f7e/data
Description: Data is downloaded automatically from the source link. Data for Sweden consists of time series data for current ICU cases.
Last accessed: 2020-12-14

Switzerland

Source name: Switzerland Federal Office of Public Health BAG (link)
Link to data: https://www.bag.admin.ch/bag/de/home/krankheiten/ausbrueche-epidemien-pandemien/aktuelle-ausbrueche-epidemien/novel-cov/situation-schweiz-und-international.html
Last accessed: 2020-06-29

The New York Times

Source name: The New York Times COVID-19 Data (link)
Link to data: https://github.com/nytimes/covid-19-data
License: Creative Commons Attribution-NonCommercial 4.0 International (link)
Citation:

Data from The New York Times, based on reports from state and local health agencies.

Last accessed: 2020-12-14

United Kingdom

Source name: GOV.UK (link)
Link to data: https://www.gov.uk/government/publications/
Description: Data is downloaded manually from the publications provided at the source link. Data is aggregated across regions in England and reported at the country level for England, Scotland, Wales and Northern Ireland. Data consists of time series data for current hospitalizations.
License: Open Government License 3.0 (link)
Last accessed: 2020-06-23