Open mathiasflick opened 3 years ago
Hey Mathias!
(which would normally keep continuity)...
normally, yes. Haha.
I have seen a small number of 'data corruptions' on their side, that they fixed within a small number of days. I did not always document these incidents, but here is one example: https://github.com/jgehrcke/covid-19-germany-gae/issues/608
For now let's maybe wait for a little bit and see if this disappears.
Happy to take a closer look otherwise.
'data mangling' in the real world is hard, as we can see here. They have a pretty robust data pipeline by now I think, but I am sure there's still potential of human error every day, like fatfingering something in a spreadsheet. We'll see, of course the problem might be entirely else.
Thank you for the report, as always!
Problem still persists today:
$ curl -sO https://raw.githubusercontent.com/jgehrcke/covid-19-germany-gae/master/cases-rki-by-ags.csv && \
python -c 'import pandas as pd; df=pd.read_csv("cases-rki-by-ags.csv",index_col=["time_iso8601"]); print(df["8126"][-20:])'
time_iso8601
2021-03-31T17:00:00+0000 4050
2021-04-01T17:00:00+0000 4094
2021-04-02T17:00:00+0000 4170
2021-04-03T17:00:00+0000 4178
2021-04-04T17:00:00+0000 4180
2021-04-05T17:00:00+0000 4181
2021-04-06T17:00:00+0000 4190
2021-04-07T17:00:00+0000 4257
2021-04-08T17:00:00+0000 4323
2021-04-09T17:00:00+0000 4358
2021-04-10T17:00:00+0000 4420
2021-04-11T17:00:00+0000 4431
2021-04-12T17:00:00+0000 4442
2021-04-13T17:00:00+0000 4211
2021-04-14T17:00:00+0000 4271
2021-04-15T17:00:00+0000 4315
2021-04-16T17:00:00+0000 4356
2021-04-17T17:00:00+0000 4413
2021-04-18T17:00:00+0000 4425
2021-04-19T17:00:00+0000 4430
Name: 8126, dtype: int64
Non-monotonic between these two:
2021-04-12T17:00:00+0000 4442
2021-04-13T17:00:00+0000 4211
OK. Found the artifact in the primary source, too:
On the BW state website for COVID-19 there is a link to this spreadsheet:
I downloaded and inspected it just now (April 20, 2021, 16:00 local time) and found:
There, the non-monotonicity is between April 16 and April 17.
Well. The actual primary source seems to be https://www.corona-im-hok.de/ and there is a link to this ArcGIS system: https://lra-hok.maps.arcgis.com/apps/dashboards/d770da3ea38643bbbd662a3e05bad5a9
When I play with that dashboard and look up the cumulative case count curve, and zoom in to the interesting date range I do not see the artifact:
The red line seems to be what we're looking for. It has the following data points:
April 12: 4508
April 13: 4568
April 14: 4628
April 15: 4670
April 16: 4711
April 17: 4755
Did leave a note with the BW state via contact form. I doubt this will even reach the right desk. : )
Die Daten für den Hohenlohekreis in https://sozialministerium.baden-wuerttemberg.de/fileadmin/redaktion/m-sm/intern/downloads/Downloads_Gesundheitsschutz/Tabelle_Coronavirus-Faelle-BW.xlsx sind rund um den 16. April scheinbar falsch.
16. April: 4634
17. April: 4398
Der kumulative Wert darf nicht sinken in dieser Zeitrichtung. Das ist bestimmt ein Tippfehler, oder?
Der Landkreis selbst gibt an (per Daten auf https://www.corona-im-hok.de/):
April 12: 4508
April 13: 4568
April 14: 4628
April 15: 4670
April 16: 4711
April 17: 4755
Mehr Diskussion zum Problem: https://github.com/jgehrcke/covid-19-germany-gae/issues/906#issuecomment-823305444
Very good findings, indeed! I am a little bit concerned about the fact, that RKI obviously develop their figures for total count and 7-days-incidence not from a single element timeline but from at least two ones, which do not necessarily need to be consistent. Otherwise their 7-days-incidence should be corrupted (like the one in my - and your :-) - (geo)graphical representation.
This is, what the RKI dashboard infobox says:
7-days-incidence plausible according to your trustworthy source. Total number of cases in line with (most likely wrong) timeline of total cases!
Greetings from Cologne Mathias
Did leave a note with the BW state via contact form. I doubt this will even reach the right desk. : )
I got a reply that my message was forwarded to the appropriate department :)
Curiosity is every engineer’s best asset ... Motivated by this discussion, I have build a little script to systematically check respective csv for non-monotonicity. This is what I found: there are 88 issues for 75 out of 401 entities (excluding "Berlin details" and 'sum_cases'). Most of them are minor (e.g. deviation of -1), but setting the threshold to 10 (giving all deviations of 11 and more) this is the result (sorry for the bad formatting, though):
5116 Mönchengladbach, Stadt 2021-03-14 -32 5754 Gütersloh 2021-03-14 -165 8111 Stuttgart, Stadtkreis 2020-05-03 -13 8126 Hohenlohekreis 2021-04-13 -229 8226 Rhein-Neckar-Kreis 2021-03-08 -24 9362 Regensburg 2020-12-18 -13 9563 Fürth 2021-03-14 -25 9573 Fürth 2021-03-14 -28 9678 Schweinfurt 2021-03-14 -87 10041 Regionalverband Saarbrücken 2021-03-12 -67 12065 Oberhavel 2020-05-03 -12 15082 Anhalt-Bitterfeld 2021-03-14 -16 16053 Jena, Stadt 2021-04-13 -11 16053 Jena, Stadt 2021-04-16 -22 16075 Saale-Orla-Kreis 2021-03-14 -48
Using a threshold of 10 there are 15 issues for 14 out of 401 entities! Besides the fact that our "Hohenlohe case" is indeed the most significant one, very obviously something happened on 2021-03-14 (with seven out of the 15 issues found with threshold 10 ...).
Greetings from Cologne Mathias
Some changes during the last few days (unfortunately not for good ...)! Output of my little script using a threshold of 10; new/modified issues are highlighted in bold:
5116 Mönchengladbach, Stadt 2021-03-14 -32 5754 Gütersloh 2021-03-14 -165 8111 Stuttgart, Stadtkreis 2020-05-03 -13 8126 Hohenlohekreis 2021-04-13 -229 8226 Rhein-Neckar-Kreis 2021-03-08 -24 9362 Regensburg 2020-12-18 -13 9471 Bamberg 2021-05-03 -12 9563 Fürth 2021-03-14 -25 9564 Nürnberg 2021-05-03 -124 9573 Fürth 2021-03-14 -28 9662 Schweinfurt 2021-05-03 -16 9678 Schweinfurt 2021-03-14 -87 9678 Schweinfurt 2021-05-03 -16 10041 Regionalverband Saarbrücken 2021-03-12 -67 12065 Oberhavel 2020-05-03 -12 15082 Anhalt-Bitterfeld 2021-03-14 -16 16053 Jena, Stadt 2021-04-13 -51 16075 Saale-Orla-Kreis 2021-03-14 -48
Using a threshold of 10 there are 18 issues for 17 out of 401 entities! Total sum of issues is 978 cases.
Since Sunday (2021-05-09) there are 168 additional cases (with threshold of 10); Sum of cases was 810 until Sunday (2021-05-09).
Greetings from Cologne Mathias
Cases for 2021-04-13: 4502 Cases for 2021-04-14: 4263 (239 less than day before!) Cases for 2021-04-15: 4306 ...
This is not in line with usual 'corrective action' taken by RKI (which would normally keep continuity)... Or am I missing something?
Greetings from Cologne Mathias