Make data available by Province_State in the JHU data

dr-itz commented 3 years ago

ECDC data reports territories separately. E.g. Puerto Rico or Greenland have separate entries while in JHU data it's cumulated to one country. E.g. for Denmark:

https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/11-23-2020.csv

has three entries for Denmark:

Denmark with 71654 cases
Greenland with 18 cases
Faroe Islands with 500 cases

The file https://github.com/owid/covid-19-data/blob/master/public/data/jhu/total_cases.csv reports this as one entry:

Denmark with 72172 cases (i.e. the sum of all three)

In https://github.com/owid/covid-19-data/blob/master/public/data/ecdc/total_cases.csv we have

Denmark with 71654
Greenland with 18
Faeroe Islands (spelled differently, irrelevant for me) with 500

So the exact same data that is available in the CSSEGISandData repo as three different values. Would it be possible to have the territories in the JHU files separately like in the ECDC files again?

This seems important since those territories are not connected to the main country geographically and show very different data...

edomt commented 3 years ago

Hi @dr-itz!

That's a very good point—the JHU data is organized according to a system of Country_Region and Province_State. For now, what's planned for our November 30 change is to aggregate cases & deaths by Country_Region, meaning that Denmark will be assigned the sum of all cases & deaths for Denmark + Greenland + Faeroe Islands. This will indeed lead to some "countries" disappearing from our dataset, since the ECDC was counting Greenland and the Faeroe Islands as countries of their own.

On the other hand, we're currently working on making our data and charts available at the Province_State level of the JHU data. This would mean that users would be able to look at Danish data in a disaggregated manner (Denmark/Greenland/Faeroe Islands), and more generally get access to subnational data such as US states, UK nations, etc.

This is still very much a work in progress as it creates quite a few downstream challenges for us in terms of datasets, user interface, etc., but we hope to make good progress in the coming weeks.

Edouard

dr-itz commented 3 years ago

thanks for the quick reply.

US state level would be really nice indeed. I was thinking about directly consuming the JHU data but their time series data doesn't include the US at all and the US file is way to big to consume from a browser. Having that in one file would give a reasonable file size and a much better visualization than having the US as one country.

RoyceWHowland commented 3 years ago

I'll just chime in my interest in disaggregated data at the Province_State level. I'm currently manually integrating Canadian provincial data and US state data from separate data sources. It's unfortunate to see some regional entries disappear for now due to the difference in "countries" mapped by ECDC and JHU, but getting Province_State at some point will more than make up for this.

Thanks to the OWID team for the incredible work you're doing on this!

sacundim commented 3 years ago

Good to see this is being looked into! I'll offer a word of warning: data about colonies is a minefield. For example, Puerto Rico is not counted as part of the USA's population by the US Census Bureau, but as a separate entry, and gets the same treatment in some statistical reports:

...but the USA's CDC is including it in its COVID-19 case and death counts, This in fact means that its current 4,225 cumulative cases/1M figure is wrong, because the numerator counts Puerto Rico cases but the denominator doesn't have Puerto Rico's population:

Your USA per capita figures probably have this mistake too. It overstates the USA's cumulative cases per capita by about 1.2%.

DrBazUK commented 3 years ago

Hi @edomt, Firstly a huge thanks to you all for the work in keeping this dataset available and, more importantly, STABLE.

I wonder if there's any word on how your work to reshape the JHU data to extract these dependent territories for Denmark, France, Netherlands, Norway, UK and US territories and states is going?

I can only imagine the complexity but I'm hopeful you might have something for us all early in January?

Stay strong, DrBazUK.

ghost commented 2 years ago

Hi @owid @edomt, could you provide an update on this https://github.com/owid/covid-19-data/issues/193#issuecomment-733631063? Thanks!

edomt commented 2 years ago

Hi @justinlee51

There's currently no plan to implement new charts or data explorers on our site with sub-national data — and I doubt that this'll change anytime soon. However, it doesn't mean that we couldn't work at least on an output CSV file for this repo (without making new content on ourworldindata.org from it). This file would basically be a reshaped version of the CSSE cases & deaths file, in a more usable format (similar to https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/jhu/full_data.csv). Is that something you'd be interested in having?

DrBazUK commented 2 years ago

Yes, 1000%. I had started to work on reshaping the CSSE data myself to include the dependent territories (Puerto Rico, Bermuda, French Polynesia etc) but if this is something you can do to add these back into the CSV datasets it would be hugely helpful.

thanks again in advance, Baz

edomt commented 2 years ago

Hi @DrBazUK @jlwj51

JHU's subnational case and death data will now be automatically exported in a reshaped format here: https://github.com/owid/covid-19-data/blob/master/public/data/jhu/subnational_cases_deaths.zip

This data is actually extremely large, amounting to more than 150 MB uncompressed. Therefore, we make it available as a compressed CSV file. You should be able to directly feed the raw zip file (https://github.com/owid/covid-19-data/raw/master/public/data/jhu/subnational_cases_deaths.zip) to pd.read_csv (in Python) or readr::read_csv (in R) to automatically unzip and read the file, without having to code that yourself.

You can preview the file's contents here:

Importantly, we're very unlikely to make more developments or improvements around this file. If users need more data added in (such as population denominators), I would advise that they write their own scripts to do so.

DrBazUK commented 2 years ago

Thanks @edomt I must be missing something when I download that zip file I get an unreadable zip.

Not sure if the error is mine in attempting to download the zip but I get a 14MB or so file that appears to be blank when attempting to open in either Windows File Explorer or 7Zip...

edomt commented 2 years ago

@DrBazUK I don't think you'll be able to extract the zip archive from a file explorer — you need to import it directly with a software that knows how to decompress + read it.

Which language or software do you ultimately want to use to analyze the data?

dr-itz commented 2 years ago

double click in macOS also doesn't work, but the CLI tools work:

> zipinfo -l subnational_cases_deaths.zip
Archive:  subnational_cases_deaths.zip
Zip file size: 15089005 bytes, number of entries: 1
?rw-------  2.0 unx 121336899 b- 15088725 defN 21-Sep-28 16:04 /home/owid/covid-19-data/scripts/scripts/../../public/data/jhu/subnational_cases_deaths.zip
1 file, 121336899 bytes uncompressed, 15088725 bytes compressed:  87.6%

but there's the problem: the path inside and the file ends with .zip, but it's a .csv really

edomt commented 2 years ago

Both of these are working on my end:

Python:

import pandas as pd
df = pd.read_csv("subnational_cases_deaths.zip")

R:

library(readr)
df <- read_csv("subnational_cases_deaths.zip")

DrBazUK commented 2 years ago

I worked out what the issue is:

Had to rename the file to add a 2 suffix to the filename before extracting to a folder using 7ZIP in Windows 10. Windows doesn't like extracting a file with the same name as the original zip and the filename of the zipped archive appears to be labeled as .zip instead of .csv
Once extracted to the folder, \Downloads\subnational_cases_deaths2\home\owid\covid-19-data\scripts\scripts\public\data\jhu the file that is in this folder is titled subnational_case_deaths.zip when it is actually a csv file.
so changing the extension as @dr-itz mentioned worked to allow the file to be read.

Thanks for the work in reshaping this data but I think for now, I'll stick with the original datasets from OWID or work out how to add in the others from JHU by another means.

owid / covid-19-data

Make data available by Province_State in the JHU data #193