open-covid-19 / data

Daily time-series epidemiology and hospitalization data for all countries, state/province data for 50+ countries and county/municipality data for CO, FR, NL, PH, UK and US. Covariates for all available regions include demographics, mobility reports, government interventions, weather and more.
https://open-covid-19.github.io/explorer
Apache License 2.0
276 stars 63 forks source link

There's no coordinates information for France #23

Closed quixote79 closed 4 years ago

quixote79 commented 4 years ago

First of all, I would like to thank you for providing the data. I think it's the best quality of all.

There's no coordinates information for France. Please check.

And there is an area in China with the same code. Shaanxi(SN), Shanxi(SN) I'd appreciate it if you could check this as well.

It would be even better if you could let us know the update time by data source.

owahltinez commented 4 years ago

Thanks for the kind words and for flagging these issues.

There's no coordinates information for France. Please check.

I just added France's provinces, but didn't have time to fill the metadata which is a manual process.

And there is an area in China with the same code. Shaanxi(SN), Shanxi(SN)

I had not noticed the error with Sha(a?)nxi, indeed there is a duplicate code but they are distinct provinces.

Both errors with France and China metadata should be easy enough to fix!

It would be even better if you could let us know the update time by data source.

This could be interpreted in a number of ways, I'm assuming you are interested in the "freshness" of the data:

  1. The time at which the event described in the record occurred -- almost impossible to know for some sources, some have a delay of 1+ day and often even longer
  2. The time at which a data source reported a record -- some report a daily snapshot at e.g. 10 AM CET, but some regularly report the entire history including fixes to old data which means that old records might get recent update times
  3. The time at which a record was last updated in data.csv -- maybe useful for auditing purposes, but doesn't tell us much about when the data was actually collected and whether it's current
  4. The last time that data.csv was updated -- also useful for auditing purposes, but seems like a bad proxy metric to measure the "freshness" of the data

To make things more complicated, there are a number of issues that have to be accounted for:

  1. Some records may be updated due to new or fixed metadata (like Shaanxi region code) but the time at which the source reported the record is unchanged -- should the update time be updated then?
  2. Some data sources used to have a large reporting delay but now are much faster, for example Spain -- we currently deliberately add days to the Date column to make sure we keep consistency in the data and to match the ECDC report, should the update time be in the future for an instance like this?
  3. Some sources do not report when a record was added or updated, for example FR regional data -- should we use different definition of update time for those cases?

Can you help me understand how you are trying to use an "UpdateTime" column, if we were to add one?

quixote79 commented 4 years ago

I think the "freshness" of the data is sufficient. I mean, I want to know when you update. :)

Mahks commented 4 years ago

I too would like to know when the data has been updated.

I am developing a web graphing tool and do not want to keep downloading data if it has not changed.

Currently I am using PHP and get_headers to check "Content-Length" and only load the file if that has changed.

It would be ok for my purpose to have the "UpdateTime" in the headers, but those without access to the file headers do not benefit. A separate file named 'UpdateTime.txt' would be good.

owahltinez commented 4 years ago

@Mahks you can get the last date that a commit was pushed to the repo using GitHub's API, see this for reference.

@quixote79 let me see if I can add a column "LastUpdated" which will represent when I last fetched the data for that particular row. So, for most rows, it will be updated every time since I'm retrieving historical data daily hoping to catch any potential corrections.

owahltinez commented 4 years ago

@quixote79 I looked into this, but I ran into a major problem. Currently, even though we are loading all historical data for almost countries, only a few rows get updated so you can visually inspect the changes. For example, look at this commit and click "Load diff".

By adding a LastUpdated column, almost every single row gets updated each time. Then the diff of the file would become useless and it's much harder to keep track of potentially bad changes to the whole file. I would have to make major changes and merge the current data with the previous days' data every time, which defeats the purpose of retrieving historical data.

As a proxy, you can look at when the repo was last updated to tell the approximate "freshness" of the data. Currently, only 2 types of data are not being updated every time the repo gets updated: Spain and Italy country-level data for dates prior to March.

To see when the last update was done to the repo, you can use GitHub's API which outputs a JSON file: https://api.github.com/repos/open-covid-19/data/commits. The date will be the "freshness" time:

$.getJSON('https://api.github.com/repos/open-covid-19/data/commits', data => {
    const lastUpdated = data[0].commit.committer.date;
}

Hopefully this is sufficient for your purposes, let me know if you have any issues with that idea.

Mahks commented 4 years ago

@owahltinez Thanks for the stackoverflow reference.

Do you want accreditation as the data source / compiler on my page? If so what would you like? Name, link etc.

owahltinez commented 4 years ago

There is no need for citation / source, but if you want to add one you can say: Open COVID-19 Dataset and link to https://github.com/open-covid-19/data.

If you care to share a link to your page, I can add it to the README file in this repo.

Mahks commented 4 years ago

Ok, thanks;

Link to site : http://www.starlords3k.com/covid19.php

owahltinez commented 4 years ago

@Mahks I'm completely blown away with your site, it is very cool!

I added a call-out to the README, please take a look and let me know if you are not happy with it.

Mahks commented 4 years ago

Thanks for the call-out, glad you like it.

I am looking to add option to display dates of major events, like when a lock down began. Do you know of any data resource for that?

I would also like to find a data source for country statistics like level of medical care, doctors per population etc.

owahltinez commented 4 years ago

I would also like to find a data source for country statistics like level of medical care, doctors per population etc.

That was discussed briefly in this other issue. I think the best source of information for that should be Wikipedia, but the information is quite sparse. A few examples:

I don't think we should add too many columns of metadata to the output data CSV file here, but I wouldn't be opposed to adding the columns to the metadata CSV file as long as they are not expected to change often (like number of hospital beds or physicians).

For something that is expected to change often and has a date associated with it, like government measures or construction of field hospitals, I would love to find a reliable source of information that we can automatically scrape and add to the output data CSV file. Ideally that would be on Wikipedia, which is both easy to scrape and accessible to everyone.

quixote79 commented 4 years ago

Please review my poor-quality work.

https://kepler.gl/demo/map?mapUrl=https://dl.dropboxusercontent.com/s/lrb24g5cc1c15ja/COVID-19_Dataset.json

Mahks commented 4 years ago

Sorry I did not mean to suggest you add that other data to your file. I wanted it for my own use.

There is this api : https://apps.who.int/gho/athena/public_docs/examples.html#ex2 Lots of data, but does not seem to be real current.

I wondered why you had all the country data repeated in your csv file. Why not have 2 files, one for country, one for data per date? For my site I parse your file into JSON format : http://www.starlords3k.com/covid19_data.json Was only 200Kb until I added the worldwide numbers.

owahltinez commented 4 years ago

Please review my poor-quality work

@quixote79 I think that looks great! I don't think I've seen any other implementations using MapBox, everyone else appears to be using ArcGIS. Do you want a call-out in the README of this repo using that link?

Sorry I did not mean to suggest you add that other data to your file. I wanted it for my own use.

@Mahks well, I think it would be useful to everyone else. My goal for this repo is to serve as a source of well-maintained, high-quality data related to COVID-19.

There is this api : https://apps.who.int/gho/athena/public_docs/examples.html#ex2 Lots of data, but does not seem to be real current.

Thanks for the link. I'll take a look but I'm not very hopeful since I've been disappointed with the quality of the data coming from WHO.

I wondered why you had all the country data repeated in your csv file. Why not have 2 files, one for country, one for data per date? For my site I parse your file into JSON format : http://www.starlords3k.com/covid19_data.json Was only 200Kb until I added the worldwide numbers.

Yes, the current solution is not optimal. I think a lot of people are building maps and the latitude/longitude + population did not make the final output unreasonably large but I would not want to add any more fields. The main issue is that performing joins with non-tabulated formats like JSON is a pain, and some of the map tools people are using simply can't do that: you provide a JSON file and then use a GUI to create a map.

What do you think about this idea:

It still would not be like the JSON file you linked, because you are pivoting the data away from the record format and using country codes as keys. I would not recommend creating your own JSON files since it will add yet another hop with additional delay -- there are currently 1-2 hops between the data coming from an authoritative source until it gets published in this repo, depending on the data source. I don't yet have it fully automated because some sources have not been 100% stable, which means that there could be multiple hours of delay right now. Once I can trust the data sources a little bit more, I hope to automatically update the data hourly.

Mahks commented 4 years ago

I don't think you have to provide JSON format, as anyone needing that could easily convert from csv. The JSON file above was created just to demonstrate the structure.

I use your csv file and convert with PHP into JSON to send to the browser. So there is no "another hop with additional delay".

What I really meant I guess was, why not have 2 csv files? One with the country data and another with the date data. I don't know if that would be problematic for other uses. Does the data have to be in one file for some reason?

country_data.csv : AF,Afghanistan,,,33.93911,67.709953,38041754 ie; code,name,code,name,lat,long,pop date_data.csv: 2019-12-31,1,0 ie; date,cases,dead

It does not matter for me as I can parse the data regardless. Mostly wonder why the repetition of data. file is 400% larger as a result.

quixote79 commented 4 years ago

@quixote79 I think that looks great! I don't think I've seen any other implementations using MapBox, everyone else appears to be using ArcGIS. Do you want a call-out in the README of this repo using that link?

Yes, please :)

owahltinez commented 4 years ago

@quixote79 I added a call-out to your map in the README, please take a look!

owahltinez commented 4 years ago

@Mahks

I don't think you have to provide JSON format, as anyone needing that could easily convert from csv.

I don't think it's strictly a need, but the cost to do it is very low and reading JSON is much easier than reading CSV from Javascript applications that have no server-side component.

I use your csv file and convert with PHP into JSON to send to the browser. So there is no "another hop with additional delay".

Ah, I see. I thought you were converting to JSON on a schedule. That's an interesting architecture, and I'm guessing that you are caching the resulting JSON file (which is why you were asking about last updated time).

What I really meant I guess was, why not have 2 csv files? One with the country data and another with the date data. I don't know if that would be problematic for other uses. Does the data have to be in one file for some reason?

I will provide what you are asking for in the form of data_minimal.csv and data_minimal.json, I think it's a good idea for those that care about performance; even though data.json is currently 2.6MB, which is manageable, it's only going to get bigger. data.csv is still under 1MB which is much better, but as you point out it's simply not very efficient.

Unfortunately, having the latitude/longitude and population data is necessary for those who want to use a map tool that only accepts a single data source, and may not be very skilled programmers with their own server to do re-processing of the CSV / JSON files provided here. By providing both data.csv and data_minimal.csv we should be able to make everyone happy without incurring too much of a maintenance burden.

owahltinez commented 4 years ago

@Mahks please take a look at the new metadata and data_minimal CSV files, I think they should correspond to what you were suggesting. To join the datasets, simply use the column Key -- use an outer join since there will be multiple keys in the data_minimal dataset, one for each date where data is available.

Mahks commented 4 years ago

I think your link to data_minimal.csv is pointing to the wrong file...

metadata.csv looks great

On Wed, 1 Apr 2020 at 02:48, Oscar Wahltinez notifications@github.com wrote:

@Mahks https://github.com/Mahks please take a look at the new metadata https://open-covid-19.github.io/data/metadata.csv and data_minimal https://open-covid-19.github.io/data/data_minimal.csv CSV files, I think they should correspond to what you were suggesting. To join the datasets, simply use the column Key -- use an outer join since there will be multiple keys in the data_minimal dataset, one for each date where data is available.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-covid-19/data/issues/23#issuecomment-606837141, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQ5P2IUH7Y4T36C2DIUVVTRKJCH5ANCNFSM4LUDGQMA .

Mahks commented 4 years ago

You changed the current data format!!!

It killed my site :(

On Wed, 1 Apr 2020 at 05:32, Mahks Doma mahks1@gmail.com wrote:

I think your link to data_minimal.csv is pointing to the wrong file...

metadata.csv looks great

On Wed, 1 Apr 2020 at 02:48, Oscar Wahltinez notifications@github.com wrote:

@Mahks https://github.com/Mahks please take a look at the new metadata https://open-covid-19.github.io/data/metadata.csv and data_minimal https://open-covid-19.github.io/data/data_minimal.csv CSV files, I think they should correspond to what you were suggesting. To join the datasets, simply use the column Key -- use an outer join since there will be multiple keys in the data_minimal dataset, one for each date where data is available.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-covid-19/data/issues/23#issuecomment-606837141, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQ5P2IUH7Y4T36C2DIUVVTRKJCH5ANCNFSM4LUDGQMA .

owahltinez commented 4 years ago

Uh oh. What are you seeing differently? I'm looking into it right now

Edit: as far as I can tell, the format looks the same. The only difference is that I have added a column called Key which should be a non-breaking change. Were you assuming the order of the columns by any chance?

Mahks commented 4 years ago

https://raw.githubusercontent.com/open-covid-19/data/master/output/data.csv

Arrghh

On Wed, 1 Apr 2020 at 05:51, Oscar Wahltinez notifications@github.com wrote:

Uh oh. What are you seeing differently? I'm looking into it right now

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-covid-19/data/issues/23#issuecomment-606925855, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQ5P2ILOXCICHLO5FWWXT3RKJXVPANCNFSM4LUDGQMA .

Mahks commented 4 years ago

That is the file I use to get the data. I can't access the other with PHP (so far as I have found)

On Wed, 1 Apr 2020 at 06:11, Mahks Doma mahks1@gmail.com wrote:

https://raw.githubusercontent.com/open-covid-19/data/master/output/data.csv

Arrghh

On Wed, 1 Apr 2020 at 05:51, Oscar Wahltinez notifications@github.com wrote:

Uh oh. What are you seeing differently? I'm looking into it right now

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-covid-19/data/issues/23#issuecomment-606925855, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQ5P2ILOXCICHLO5FWWXT3RKJXVPANCNFSM4LUDGQMA .

owahltinez commented 4 years ago

OK, I understand the problem. You are using the raw github output. Please use instead this link: https://open-covid-19.github.io/data/data.csv

Mahks commented 4 years ago

do you use discord? https://discord.gg/TxUkaqy

On Wed, 1 Apr 2020 at 06:12, Mahks Doma mahks1@gmail.com wrote:

That is the file I use to get the data. I can't access the other with PHP (so far as I have found)

On Wed, 1 Apr 2020 at 06:11, Mahks Doma mahks1@gmail.com wrote:

https://raw.githubusercontent.com/open-covid-19/data/master/output/data.csv

Arrghh

On Wed, 1 Apr 2020 at 05:51, Oscar Wahltinez notifications@github.com wrote:

Uh oh. What are you seeing differently? I'm looking into it right now

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-covid-19/data/issues/23#issuecomment-606925855, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQ5P2ILOXCICHLO5FWWXT3RKJXVPANCNFSM4LUDGQMA .

owahltinez commented 4 years ago

Closing this issue, since all the problems listed here have been resolved. Feel free to open new issues to provide feedback or ask questions!