Closed OmarJay1 closed 4 years ago
Great question. Unfortunately, we don't have a formal definition that could answer it so let me explain what we do instead:
new_
and total_
then we record both valuesnew_
or total_
then we record it and estimate the missing oneData sources frequently change values in the past. Sometimes it's a minor change (e.g. double counting one case on date X) but it can affect all the dates after that change if, for example, the cumulative counts are altered.
There are also cases where data sources decide to change how to count things. For example, Spain decided to change the requirements to consider COVID-19 the cause for a fatality and that resulted in a fairly large negative value for one particular date.
Thanks. I've been toying with how to smooth out negative increases. I know they're technically what's been officially reported. However, the negative values are confusing on charts that show daily differences. Also, the notion that negative values of people were diagnosed or died is scientifically inaccurate.
I'm thinking of:
I've been meaning to post some code and neural nets I've been working on in my repository. I'll post that code when it's done.
Thanks.
You can allocate the "missing" values proportionally as you suggest, but I would be wary of potential effects on analysis since outbreaks tend to happen in batches.
If you want to run your modeling on new_
values, you can just ignore the negative values (or set them to zero).
If you want to run your modeling on total_
values, it's probably a more robust metric since some issues with the data can be masked (it doesn't matter if one day there is missing data, or if the data is zero). But then you run into the problem of dealing with a variable that is supposed to be monotonic sometimes decreasing.
One way you can adjust these values to keep them monotonic is to compute your own adjusted version like this:
new_
values < 0new_
The only problem then is that the total will not be the same as what's reported elsewhere, but for the purposes of modeling that shouldn't matter.
I think that reallocating negatives is legitimate. It just needs to be disclosed. I've gone back to importing the raw data. I'm going to have to make different versions for certain graphs. Right now I just give an explanation for negative daily values. Thanks.
I'll close this out for now, let me know if you have any more questions about the data.
FYI we have moved the files to Google Cloud Storage because we are running into the limits of GitHub Pages. The new endpoint for the files is https://storage.cloud.google.com/covid19-open-data/v2/latest/main.csv (we will update the documentation shortly)
Please beware that we renamed the master
table to main
:-)
I'm curious about the issue of negative new cases and deaths. Is it assumed that older data never gets changed, and is only lowered by issuing a daily negative value?
Thanks.