Closed dr-itz closed 2 years ago
Hi @dr-itz!
That's a very good point—the JHU data is organized according to a system of Country_Region
and Province_State
. For now, what's planned for our November 30 change is to aggregate cases & deaths by Country_Region
, meaning that Denmark will be assigned the sum of all cases & deaths for Denmark + Greenland + Faeroe Islands. This will indeed lead to some "countries" disappearing from our dataset, since the ECDC was counting Greenland and the Faeroe Islands as countries of their own.
On the other hand, we're currently working on making our data and charts available at the Province_State
level of the JHU data. This would mean that users would be able to look at Danish data in a disaggregated manner (Denmark/Greenland/Faeroe Islands), and more generally get access to subnational data such as US states, UK nations, etc.
This is still very much a work in progress as it creates quite a few downstream challenges for us in terms of datasets, user interface, etc., but we hope to make good progress in the coming weeks.
Edouard
thanks for the quick reply.
US state level would be really nice indeed. I was thinking about directly consuming the JHU data but their time series data doesn't include the US at all and the US file is way to big to consume from a browser. Having that in one file would give a reasonable file size and a much better visualization than having the US as one country.
I'll just chime in my interest in disaggregated data at the Province_State
level. I'm currently manually integrating Canadian provincial data and US state data from separate data sources. It's unfortunate to see some regional entries disappear for now due to the difference in "countries" mapped by ECDC and JHU, but getting Province_State
at some point will more than make up for this.
Thanks to the OWID team for the incredible work you're doing on this!
Good to see this is being looked into! I'll offer a word of warning: data about colonies is a minefield. For example, Puerto Rico is not counted as part of the USA's population by the US Census Bureau, but as a separate entry, and gets the same treatment in some statistical reports:
...but the USA's CDC is including it in its COVID-19 case and death counts, This in fact means that its current 4,225 cumulative cases/1M figure is wrong, because the numerator counts Puerto Rico cases but the denominator doesn't have Puerto Rico's population:
Your USA per capita figures probably have this mistake too. It overstates the USA's cumulative cases per capita by about 1.2%.
Hi @edomt, Firstly a huge thanks to you all for the work in keeping this dataset available and, more importantly, STABLE.
I wonder if there's any word on how your work to reshape the JHU data to extract these dependent territories for Denmark, France, Netherlands, Norway, UK and US territories and states is going?
I can only imagine the complexity but I'm hopeful you might have something for us all early in January?
Stay strong, DrBazUK.
Hi @owid @edomt, could you provide an update on this https://github.com/owid/covid-19-data/issues/193#issuecomment-733631063? Thanks!
Hi @justinlee51
There's currently no plan to implement new charts or data explorers on our site with sub-national data — and I doubt that this'll change anytime soon. However, it doesn't mean that we couldn't work at least on an output CSV file for this repo (without making new content on ourworldindata.org from it). This file would basically be a reshaped version of the CSSE cases & deaths file, in a more usable format (similar to https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/jhu/full_data.csv). Is that something you'd be interested in having?
Yes, 1000%. I had started to work on reshaping the CSSE data myself to include the dependent territories (Puerto Rico, Bermuda, French Polynesia etc) but if this is something you can do to add these back into the CSV datasets it would be hugely helpful.
thanks again in advance, Baz
Hi @DrBazUK @jlwj51
JHU's subnational case and death data will now be automatically exported in a reshaped format here: https://github.com/owid/covid-19-data/blob/master/public/data/jhu/subnational_cases_deaths.zip
This data is actually extremely large, amounting to more than 150 MB uncompressed. Therefore, we make it available as a compressed CSV file. You should be able to directly feed the raw zip file (https://github.com/owid/covid-19-data/raw/master/public/data/jhu/subnational_cases_deaths.zip) to pd.read_csv
(in Python) or readr::read_csv
(in R) to automatically unzip and read the file, without having to code that yourself.
You can preview the file's contents here:
Importantly, we're very unlikely to make more developments or improvements around this file. If users need more data added in (such as population denominators), I would advise that they write their own scripts to do so.
Thanks @edomt I must be missing something when I download that zip file I get an unreadable zip.
Not sure if the error is mine in attempting to download the zip but I get a 14MB or so file that appears to be blank when attempting to open in either Windows File Explorer or 7Zip...
@DrBazUK I don't think you'll be able to extract the zip archive from a file explorer — you need to import it directly with a software that knows how to decompress + read it.
Which language or software do you ultimately want to use to analyze the data?
double click in macOS also doesn't work, but the CLI tools work:
> zipinfo -l subnational_cases_deaths.zip
Archive: subnational_cases_deaths.zip
Zip file size: 15089005 bytes, number of entries: 1
?rw------- 2.0 unx 121336899 b- 15088725 defN 21-Sep-28 16:04 /home/owid/covid-19-data/scripts/scripts/../../public/data/jhu/subnational_cases_deaths.zip
1 file, 121336899 bytes uncompressed, 15088725 bytes compressed: 87.6%
but there's the problem: the path inside and the file ends with .zip
, but it's a .csv
really
Both of these are working on my end:
Python:
import pandas as pd
df = pd.read_csv("subnational_cases_deaths.zip")
R:
library(readr)
df <- read_csv("subnational_cases_deaths.zip")
I worked out what the issue is:
Thanks for the work in reshaping this data but I think for now, I'll stick with the original datasets from OWID or work out how to add in the others from JHU by another means.
ECDC data reports territories separately. E.g. Puerto Rico or Greenland have separate entries while in JHU data it's cumulated to one country. E.g. for Denmark:
https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/11-23-2020.csv
has three entries for Denmark:
The file https://github.com/owid/covid-19-data/blob/master/public/data/jhu/total_cases.csv reports this as one entry:
In https://github.com/owid/covid-19-data/blob/master/public/data/ecdc/total_cases.csv we have
So the exact same data that is available in the CSSEGISandData repo as three different values. Would it be possible to have the territories in the JHU files separately like in the ECDC files again?
This seems important since those territories are not connected to the main country geographically and show very different data...