jgehrcke / covid-19-germany-gae

COVID-19 statistics for Germany. For states and counties. With time series data. Daily updates. Official RKI numbers.
MIT License
145 stars 48 forks source link

No RKI updates since 4 days ... (Landkreis 16056 disappeared from RKI data set) #1748

Open mathiasflick opened 2 years ago

mathiasflick commented 2 years ago

There are no updates to the rki files since four days now (as of 2021-10-04, 20:45 local time). Is there a problem with changes to the input data provided by RKI? If yes, how can I help?

Greetings from Cologne Mathias

jgehrcke commented 2 years ago

Thank you @mathiasflick for the report.

I had a quick look into logs and found

Traceback (most recent call last):
  File "tools/build-rki-csvs.py", line 499, in <module>
    main()
  File "tools/build-rki-csvs.py", line 52, in main
    df_by_lk, df_berlin_cases_sum, df_berlin_deaths_sum = fetch_and_clean_data()
  File "tools/build-rki-csvs.py", line 176, in fetch_and_clean_data
    assert lacking_wrt_ref == set([11000, 3152])
AssertionError

Looks like once again the set of amtliche gemeindeschlüssel changed in the RKI data set -- in the past that has always been a human error somewhere in the pipeline. The code might be overly strict. I might be able to precisely understand and fix this tomorrow. Hopefully.

jgehrcke commented 2 years ago

Data for this Landkreis were missing, recently:

  "16056": {
    "name": "SK Eisenach",
    "state": "Thüringen",
    "lat": 50.9833,
    "lon": 10.3167,
    "population": 42250
  },
jgehrcke commented 2 years ago

I may want to remove the lacking_wrt_ref check, update csv-epsilon-merge.py to allow for base set to contain more columns than extension set -- and then to forward-fill those columns.

jgehrcke commented 2 years ago

On vacation. Didn't get to this yet. Sorry about that :/

jgehrcke commented 2 years ago

I have addressed this in #1827.

jgehrcke commented 2 years ago

I have looked at the data more closely to better understand what happened. The fact that 16056 disappeared from the RKI data set made me 'hope' that reporting for this Landkreis was merged with another Landkreis.

Indeed, there is a pretty suspicious case numer jump for Landkreis 16063 at the time when the case count for Landkreis 16056 did not change anymore:

Screenshot from 2021-10-20 13-44-07

That jump is specifically from 8579 to 10572:

>>> 10572 - 8579
1993

The last reported case count value for Landkreis 16056 was 1975.

I think we can safely conclude that on September 12, reporting for Landkreise 16056 and 16063 was merged, and reported together under AGS 16063.

jgehrcke commented 2 years ago

With the solution from #1827 I have now retained Landkreis 16056 in the CSV files, simply forwarding the last known value (1975). That's incorrect, the value should drop to 0 so that the sum over the Landkreise evolves more correctly. Given the relatively small number though I think I will just leave this as-is. Feedback appreciated.

jgehrcke commented 2 years ago

I have just looked at the columns 16056 and 16063 the RL data set. They have seemingly be synced a while ago: they contain the same values, for the entire time range of interest. (that is, the sum is also wrong)

jgehrcke commented 2 years ago

The two landkreise in question:

  "16056": {
    "name": "SK Eisenach",
    "state": "Thüringen",
  "16063": {
    "name": "LK Wartburgkreis",
    "state": "Thüringen",

on a map: Screenshot from 2021-10-20 13-57-44

(from https://www.bik-gmbh.de/download/Gebietsreform_Thueringen_zum_GS1906.pdf)

jgehrcke commented 2 years ago

So, I think it's fair to say that Eisenach, kreisfreie Stadt case numbers are reported as part of Wartburgkreis, which geographically and organizationally might make sense.

mathiasflick commented 2 years ago

Some research regarding local reporting of corona-related indicators (e.g. for Eisenach and Wartburgkreis) clearly support your assumption - although I was not able to find any kind of official confirmation. Probably it is a politically motivated move in order to get "better" (i.e. lower) numbers by averaging the high one out ... But that is just my personal opinion! Anyway - this kind of "summarization" does create problems with the processing of data in dependent systems - leaving zero values and/or grey areas like e.g in the RKI dashboard:

Screenshot 2021-10-23 at 15-23-28 RKI COVID-19 Germany

By the way, the zero for Luckenwalde/Parchim is caused by a hacking incident - they are not able to deliver ... Source: https://www.kreis-lup.de/corona/

Greetings from Cologne Mathias

jgehrcke commented 2 years ago

Thank you Mathias for the additional insight! Huh. :)

jgehrcke commented 2 years ago

RL did drop the data colums for landkreis 16056 and that required further patches -- done in https://github.com/jgehrcke/covid-19-germany-gae/pull/1842.

Both the RL and RKI heatmaps now show 16056+16063 both using the data from 16063.

mathiasflick commented 2 years ago

Perfect! Thank you so much for your work! Now I need to start my own upstream patching ... Greetings from Cologne Mathias

mathiasflick commented 2 years ago

After a little bit of research I probably found the reason for the unexpected change: According to information provided by the state of Thüringen, Eisenach was officially made part of the Wartburgkreis (effective as of 2021-07-01). Source: https://statistik.thueringen.de/datenbank/gemauswahl.asp
A problem remaining for me (I just do not remember ...) is, where we get the population from (ags.json) and whether the change is already incorporated there (important for 7di computation) and when officially updated maps (shapefiles) will be available. Thank you again and greetings from Cologne Mathias

jgehrcke commented 2 years ago

A problem remaining for me (I just do not remember ...) is, where we get the population from (ags.json) and whether the change is already incorporated there

Hey Mathias. Ouch. Thank you for that reminder. I will have to double-check, but it's likely that 7di number have been a little off for 16063 because I didn't think this through before. Thank you!

Keeping track of this topic here: https://github.com/opstrace/opstrace/issues/1472