open-covid-19 / data

Daily time-series epidemiology and hospitalization data for all countries, state/province data for 50+ countries and county/municipality data for CO, FR, NL, PH, UK and US. Covariates for all available regions include demographics, mobility reports, government interventions, weather and more.
https://open-covid-19.github.io/explorer
Apache License 2.0
276 stars 63 forks source link

Is the data considered to be transactional? #176

Closed OmarJay1 closed 4 years ago

OmarJay1 commented 4 years ago

I'm curious about the issue of negative new cases and deaths. Is it assumed that older data never gets changed, and is only lowered by issuing a daily negative value?

Thanks.

owahltinez commented 4 years ago

Great question. Unfortunately, we don't have a formal definition that could answer it so let me explain what we do instead:

  1. If a data source provides new_ and total_ then we record both values
  2. If a data source only provides new_ or total_ then we record it and estimate the missing one

Data sources frequently change values in the past. Sometimes it's a minor change (e.g. double counting one case on date X) but it can affect all the dates after that change if, for example, the cumulative counts are altered.

There are also cases where data sources decide to change how to count things. For example, Spain decided to change the requirements to consider COVID-19 the cause for a fatality and that resulted in a fairly large negative value for one particular date.

OmarJay1 commented 4 years ago

Thanks. I've been toying with how to smooth out negative increases. I know they're technically what's been officially reported. However, the negative values are confusing on charts that show daily differences. Also, the notion that negative values of people were diagnosed or died is scientifically inaccurate.

I'm thinking of:

  1. Identifying negative events
  2. Setting them to 0
  3. Allocating the decreases proportionally to the previous 30 days.

I've been meaning to post some code and neural nets I've been working on in my repository. I'll post that code when it's done.

Thanks.

owahltinez commented 4 years ago

You can allocate the "missing" values proportionally as you suggest, but I would be wary of potential effects on analysis since outbreaks tend to happen in batches.

If you want to run your modeling on new_ values, you can just ignore the negative values (or set them to zero).

If you want to run your modeling on total_ values, it's probably a more robust metric since some issues with the data can be masked (it doesn't matter if one day there is missing data, or if the data is zero). But then you run into the problem of dealing with a variable that is supposed to be monotonic sometimes decreasing.

One way you can adjust these values to keep them monotonic is to compute your own adjusted version like this:

  1. Remove all new_ values < 0
  2. Compute the cumulative sum of new_

The only problem then is that the total will not be the same as what's reported elsewhere, but for the purposes of modeling that shouldn't matter.

OmarJay1 commented 4 years ago

I think that reallocating negatives is legitimate. It just needs to be disclosed. I've gone back to importing the raw data. I'm going to have to make different versions for certain graphs. Right now I just give an explanation for negative daily values. Thanks.

https://omnimodel.com/graph1?key=ES

https://omnimodel.com/graph1?key=ES_CM

owahltinez commented 4 years ago

I'll close this out for now, let me know if you have any more questions about the data.

FYI we have moved the files to Google Cloud Storage because we are running into the limits of GitHub Pages. The new endpoint for the files is https://storage.cloud.google.com/covid19-open-data/v2/latest/main.csv (we will update the documentation shortly)

Please beware that we renamed the master table to main :-)