jgehrcke / covid-19-germany-gae

COVID-19 statistics for Germany. For states and counties. With time series data. Daily updates. Official RKI numbers.
MIT License
145 stars 47 forks source link

Potential error in calculation of daily deaths #666

Open sarahckramer opened 3 years ago

sarahckramer commented 3 years ago

Hi there!

First of all, I want to say thank you for all of the work you've done in making these data available in a more user-friendly format.

I have also been exploring/using the RKI COVID-19 data in my work, and wanted to alert you to a possible error in your calculation of daily deaths (perhaps related to issue #227). For example, "deaths_rki_by_state.csv" has the first deaths occurring on March 4, even though the first deaths from COVID in Germany did not occur until the 9th.

I'm not sure if this is why, but I know that I originally made the error of assuming that the field "Meldedatum" referred to the date on with the death was reported, and I wonder if you have been making the same assumption. Confusingly, the "Meldedatum" field refers to the date on which the case was first reported, and the only way to tell when a death occurred is using the "Datenstand" field on the day the death first appears in the dataset - see comments here: https://www.arcgis.com/home/item.html?id=f10774f1c63e40168479a1feb6c7ca74, for example: "Das Meldedatum ist immer das Datum, wann dem zuständigem Gesundheitsamt der Fall bekannt gemacht worden ist. Sollte der Fall am 1.5.2020 positivt getestet werden und der Fall wird am 1.5.2020 dem Gesundheitsamt bekannt gemacht, so ist das Meldedatum der 1.5.2020. Sollte der Fall im Nachgang, z.B. am 7.5.2020 versterben, so wird das Meldedatum nicht geändert und bezieht sich immernoch auf den 1.5.2020." Could this be what is causing the discrepancies? (Apologies if you're already aware of the somewhat weird way Meldedatum is defined.)

Best, Sarah

jgehrcke commented 3 years ago

Hey! Thanks for the detailed feedback Sarah!

First of all, I want to say thank you for all of the work you've done in making these data available in a more user-friendly format.

Thank you for the kind words!

Confusingly, the "Meldedatum" field refers to the date on which the case was first reported

There certainly was a time when I didn't know about this, but in the meantime this has been known to me for a couple of months :). It's buried in the README of this repository where I write

Note: there is a systematic difference between the RKI data-based death rate curve and the Risklayer-based death rate curve. Both curves are wrong, and yet both curves are legit. The incidents of death that we learn about today may have happened days or weeks in the past. Neither curve attempts to show the exact time of death (sadly! :-)) The RKI curve, in fact, is based on the point in time when each corresponding COVID-19 case that led to death was registered in the first place ("Meldedatum" of the corresponding case). The Risklayer data set to my knowledge pretends as if the incidents of death we learn about today happened yesterday. While this is not true, the resulting curve is a little more intuitive.

Could this be what is causing the discrepancies?

This is unrelated to #227, if that's what you're referring to here :). But no worries.

(Apologies if you're already aware of the somewhat weird way Meldedatum is defined.)

No worries at all! Truly appreciate every pair of eyeballs on all this, and people joining the discussion :)

the only way to tell when a death occurred is using the "Datenstand" field on the day the death first appears in the dataset

This is indeed an interesting approach, but then again this field could carry a wrong/misleading date, too. You probably agree that it's a little unintuitive (and sad) that the reporting chain and after all the RKI does not simply and explicitly track the 'day of death'. And maybe they track this even, but the ArcGIS databases don't expose it. Or we are to stupid to find it :-).

As I wrote in the quoted paragraph above, from my point of view the Risklayer data-based evolution of deaths appears to be somewhat more useful.

In any case, please please report back when you find more interesting detail, and especially when you find ways to remove ambiguity / get more clarity about the date-of-death based on RKI data.

jgehrcke commented 3 years ago

see comments here: https://www.arcgis.com/home/item.html?id=f10774f1c63e40168479a1feb6c7ca74

By the way: super exciting discussion in that comment thread. Thanks for linking to that.

jgehrcke commented 3 years ago

It's linked in the README, but for clarity I want to add that this repository uses the ArcGIS feature server called RKI_COVID19_Sums (not the RKI_COVID19 one) which has its meta data explained here: https://www.arcgis.com/home/item.html?id=9644cad183f042e79fb6ad00eadc4ecf. (same considerations hold true for the relationship between Meldedatum and actual time of death).

sarahckramer commented 3 years ago

Whoops, I should have read the readme a bit more closely! Thank you for the clarification and for your kind response.

It seems like the RKI is at least willing to give access to at least Bundesland-level deaths by week of death to individual researchers for private use, but it's a shame the information isn't more easily/publicly accessible, since the data would be really useful for a lot of research purposes. I unfortunately don't know of any better way of processing the death data than what you've described for the Risklayer data, but I will definitely let you know if I learn anything new!