open-covid-19 / data

Daily time-series epidemiology and hospitalization data for all countries, state/province data for 50+ countries and county/municipality data for CO, FR, NL, PH, UK and US. Covariates for all available regions include demographics, mobility reports, government interventions, weather and more.
https://open-covid-19.github.io/explorer
Apache License 2.0
276 stars 63 forks source link

Potentially Useful Data for Machine Learning #20

Closed wilschmidtt closed 4 years ago

wilschmidtt commented 4 years ago

I have used the code and data provided in the this repository to create my own pipeline which is being used to train various ML models. The pipeline that I have created outputs data in a similar format to the data in this repo, although it does have a few tweaks which make it easier to use for ML.

I deleted all the entries for the US, Spain, and China that didn't include a region name, and I added in all the available data for these three countries that does include a region name. I have added in populations for all rows, in addition to a new 'PercentConfirmed' column (number of confirmed cases/population) and a 'SafetyMeasures' column that is meant to predict the date in which each location started implementing shelter-in-place orders. The 'SafetyMeasures' column has a 0 which translates to 'no', and a 1 which translates to 'yes'. All rows start at 0, then when the 'PercentConfirmed' column exceeds 0.002% (this threshold can easily be adjusted if necessary) , the column changes to 1. This "SafetyMeasures' column is very useful for ML because the models are very thrown off by the sudden decrease in new cases, such as China, which has been around 80,000 confirmed cases for the past two weeks, despite a rapid increase prior to that. If the model knows the date that safety measures were put into place, it can anticipate this 'leveling out' of new cases. One last column that I added was the 'Days Since 2019-12-31' column, which helps the ML models better interpret each date.

I included a screenshot of what the data looks like, and I wanted to ask if it would be of use for anyone if I uploaded this data to the repo. I run the pipeline twice a day (8:00 a.m. PST and 8:00 p.m. PST) and the data is accurate up to this morning.

As I said, the data is optimized for ML, so it could possibly be of use to those looking to do such.

coronavirus_data

owahltinez commented 4 years ago

Hey WIlliam, thanks for sharing -- this is pretty cool! I think that adding all the columns you propose might make the main dataset a bit bloated, but some of them I'd love to add if we can find a reliable source for them. Specifically, I'd like to get a better understanding of where you got the SafetyMeasures data from. If we can get a reliable source for that, we could add a column to the dataset for:

If you want to, you can open a PR and edit the relevant metadata_*.csv files and fill the Population and SafetyMeasures columns. Unless I missed something, you can infer the other columns that you mentioned from the data itself.

wilschmidtt commented 4 years ago

The SafetyMeasures column wasn't fetched from any online source. I looked online for a site that reported this information but I couldn't find anything useful. I simply populated this column by dividing the number of confirmed cased by the population, and when the number of confirmed cases exceeded 0.002% of the population, I changed the SafetyMeasures column from 0 to 1. This method is a bit arbitrary, so I could see why it might not be the best feature to include. I simply chose 0.002% based on observing at what point different locations started to take action. From what I observed, this came right around 0.002% of location's population being infected by the virus.

I agree that international_travel, local_travel, and shelter_in_place would all be much more reliable features. The only problem is that I am not sure where such data would be available.

I will open a PR to edit the metadata populations in the meantime.

dataf3l commented 4 years ago

@wilschmidtt , I suggest in addition to the a safety_measures one includes a safety_measures_start_date so that when new countries adopt the measures, the model is still useful, given so many countries have different measures.

Also, we can make a poll where we ask individuals from all countries to participate and provide information so that we can fill out these data easily. if you write a google forms poll I can send it to friends in Nepal, India, US, Colombia, Chile, Mexico, Australia, Peru, Belgium, and France and I can also translate the poll to Spanish in order to share it with people from the latin american region.

All you need is one nerd per country and you are set, this person can become an information source, also the poll should ask people what is the source of their data.

If the problem is data collection, I think we can find the people to help.

Just send the questions in English, and I'll send back the data in CSV or whatever format you want.

Remember, the less questions, the more datapoints.

dataf3l commented 4 years ago

actually, I just noticed there is a date on the dataset, so nevermind, my suggestion doesn't make sense.

owahltinez commented 4 years ago

@dataf3l I think your idea is still valid, we can put the safety measures in its own CSV table and them merge during the data processing stage. In my opinion the biggest difficulty would be to keep it up to date, since measures are changing very fast across different countries.

wilschmidtt commented 4 years ago

@dataf3l this could still be a good idea. Like I said, the 'SafetyMeasures' column is pretty arbitrarily chosen at this point. I couldn't find a good source of data indicating when each location started issuing quarantines. I had to search all over the web, and each bit of information that I found was exclusive to one location, so trying to fill it in for every location would take far too long.

From what I observed, it seemed that right around 0.002 % confirmed is when the governments started to feel the pressure and issue warnings to the public. I tried to use this information to infer the date in which preventative measures were put into place, but if there were actual sources that could verify this date then I think that would be even better.

wilschmidtt commented 4 years ago

@dataf3l there is also the problem of keeping it up to date. The nice thing about the 0.002 % threshold is that it automates the process and doesn't require any manipulation of the data by the user.

dataf3l commented 4 years ago

I think that's interesting, what about renaming the column HasPassed2PercentSoWeGuesstimateMeasureHaveBeenTakenButHaveNoRealDataSoIt'sJustAGuess :p

dataf3l commented 4 years ago

I'm merely joking, I see having no data is clearly an issue. having up to date data will also be an issue.

wilschmidtt commented 4 years ago

@dataf3l this is a decent suggestion. But I was thinking something more along the lines of ArbitrarilyChosen2PercentBecauseImTooLazyToFindRealSourcesAndUpdateTheDataEachDaySoThisIsAllWeGot

dataf3l commented 4 years ago

here is what the dataset could look like:

CO: 2020-03-19:https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Colombia PE:2020-03-22:https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Bolivia BR:????:https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Brazil CL:2020-03-22:https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Chile

here is where I got the data from:

Other countries: https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_South_America#Argentina Other continents: https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic_by_country_and_territory

I think as people spend more time on it, it is likely that we'll be able to improve the dataset. Let's make this happen.

If you make a Google Forms doc, I'll send it around :)

owahltinez commented 4 years ago

@dataf3l thank you for those links, that makes me wonder if a better approach would be to propose the creation of a new table in the Wikipedia page rather than trying to collect that data in this repo. That way, the data will be made available to a lot more people and we can still scrape it from Wikipedia ourselves.

Personally, I would prefer to keep the efforts in this repo focused towards (automated) data aggregation rather than the creation of crowd-sourced data -- even though crowd-sourced data was the original intent of this repo!

dataf3l commented 4 years ago

Should mankind make an app to track movements and self-report if one has symptoms so that people can avoid paths with people with symptoms?

owahltinez commented 4 years ago

FYI I have added mobility and government measures datasets which are relevant to this discussion.