Closed psteinb closed 4 years ago
This is expected behaviour. That's how these entries in the csv look like:
3 | Niedersachsen | Region Hannover | A60-A79 | M | -1 | -1 | 404828 | 2020-03-19T00:00:00.000Z | 3241 | 27.03.2020 00:00 | -1 | -1
8 | Baden-Württemberg | LK Ludwigsburg | A80+ | W | -1 | -1 | 412279 | 2020-03-22T00:00:00.000Z | 8118 | 27.03.2020 00:00 | -1 | -1
In the columns AnzahlFall
and AnzahlTodesfall
we have -1 and that equals afaik NA. Is this wrong or am I missing something?
Good point. I just checked again in the downloaded file. Indeed, the -1
is there. Then I have to check why this doesn't pop up
in my validation checks.
Any idea what that means?
What are you validating and how?
Please don't interpret this too much, I'm referring to a simple:
if(nrow(is.na(df $>$ filter(is.na(AnzahlTodesfall)))){
stop("...")
}
Are you in contact with RKI by any chance to report these rows with -1
?
These -1 rows don't have to be a mistake. My understanding is that they indicate explicit null-observations, which is fine.
I don't understand, then why not use 0
? And why are these -1
in both columns:
| Bundesland | Landkreis | Altersgruppe | AnzahlFall | AnzahlTodesfall | ObjectId | Meldedatum | NeuerFall | NeuerTodesfall |
| ----------------- | --------------- | ------------ | ---------- | --------------- | -------- | ------------------------- | --------- | -------------- |
| Niedersachsen | Region Hannover | A60-A79 | -1 | -1 | 404,828 | 2020-03-19 00:00:00+00:00 | -1 | -1 |
| Baden-Württemberg | LK Ludwigsburg | A80+ | -1 | -1 | 412,279 | 2020-03-22 00:00:00+00:00 | -1 | -1 |
And there is more if you look at rows with AnzahlFall < 0
$ csvgrep -c6 -m '-1' RKI_COVID19.csv|csvcut -c2-4,6-8,12-13|csvlook
| Bundesland | Landkreis | Altersgruppe | AnzahlFall | AnzahlTodesfall | ObjectId | NeuerFall | NeuerTodesfall |
| ------------------- | ------------------------------------ | ------------ | ---------- | --------------- | -------- | --------- | -------------- |
| Niedersachsen | Region Hannover | A00-A04 | -1 | 0 | 404,722 | -1 | -9 |
| Niedersachsen | Region Hannover | A60-A79 | -1 | -1 | 404,828 | -1 | -1 |
| Nordrhein-Westfalen | SK Bonn | A60-A79 | -1 | 0 | 406,917 | -1 | -9 |
| Nordrhein-Westfalen | LK Heinsberg | A60-A79 | -1 | 0 | 407,578 | -1 | -9 |
| Schleswig-Holstein | LK Rendsburg-Eckernförde | A60-A79 | -1 | 0 | 404,020 | -1 | -9 |
| Hamburg | SK Hamburg | A15-A34 | -1 | 0 | 404,244 | -1 | -9 |
| Hamburg | SK Hamburg | A15-A34 | -1 | 0 | 404,275 | -1 | -9 |
| Hamburg | SK Hamburg | A35-A59 | -1 | 0 | 404,305 | -1 | -9 |
| Hamburg | SK Hamburg | A35-A59 | -1 | 0 | 404,337 | -1 | -9 |
| Nordrhein-Westfalen | SK Münster | A35-A59 | -1 | 0 | 407,953 | -1 | -9 |
| Nordrhein-Westfalen | LK Borken | A80+ | -1 | 0 | 408,109 | -1 | -9 |
| Nordrhein-Westfalen | LK Coesfeld | unbekannt | -1 | 0 | 408,189 | -1 | -9 |
| Hessen | LK Waldeck-Frankenberg | A35-A59 | -1 | 0 | 410,447 | -1 | -9 |
| Baden-Württemberg | LK Breisgau-Hochschwarzwald | A80+ | -1 | 0 | 413,797 | -1 | -9 |
| Baden-Württemberg | LK Breisgau-Hochschwarzwald | unbekannt | -1 | 0 | 413,800 | -1 | -9 |
| Baden-Württemberg | LK Tübingen | A35-A59 | -1 | 0 | 414,407 | -1 | -9 |
| Niedersachsen | LK Osnabrück | A60-A79 | -1 | 0 | 405,814 | -1 | -9 |
| Bremen | SK Bremen | A35-A59 | -1 | 0 | 405,945 | -1 | -9 |
| Nordrhein-Westfalen | SK Düsseldorf | A35-A59 | -1 | 0 | 406,043 | -1 | -9 |
| Nordrhein-Westfalen | LK Minden-Lübbecke | A15-A34 | -1 | 0 | 408,693 | -1 | -9 |
| Hessen | SK Offenbach | A15-A34 | -1 | 0 | 409,453 | -1 | -9 |
| Hessen | LK Bergstraße | unbekannt | -1 | 0 | 409,560 | -1 | -9 |
| Bayern | LK Bad Tölz-Wolfratshausen | A15-A34 | -1 | 0 | 415,265 | -1 | -9 |
| Bayern | LK Ansbach | A60-A79 | -1 | 0 | 418,019 | -1 | -9 |
| Rheinland-Pfalz | SK Ludwigshafen | A15-A34 | -1 | 0 | 411,169 | -1 | -9 |
| Sachsen | SK Dresden | A35-A59 | -1 | 0 | 421,246 | -1 | -9 |
| Berlin | SK Berlin Mitte | unbekannt | -1 | 0 | 419,659 | -1 | -9 |
| Berlin | SK Berlin Friedrichshain-Kreuzberg | A15-A34 | -1 | 0 | 419,678 | -1 | -9 |
| Berlin | SK Berlin Friedrichshain-Kreuzberg | A35-A59 | -1 | 0 | 419,725 | -1 | -9 |
| Berlin | SK Berlin Friedrichshain-Kreuzberg | A35-A59 | -1 | 0 | 419,727 | -1 | -9 |
| Berlin | SK Berlin Pankow | A15-A34 | -1 | 0 | 419,753 | -1 | -9 |
| Berlin | SK Berlin Charlottenburg-Wilmersdorf | A60-A79 | -1 | 0 | 419,897 | -1 | -9 |
| Baden-Württemberg | LK Ludwigsburg | A00-A04 | -1 | 0 | 412,151 | -1 | -9 |
| Baden-Württemberg | LK Ludwigsburg | A05-A14 | -1 | 0 | 412,154 | -1 | -9 |
| Baden-Württemberg | LK Ludwigsburg | A15-A34 | -1 | 0 | 412,179 | -1 | -9 |
| Baden-Württemberg | LK Ludwigsburg | A80+ | -1 | -1 | 412,279 | -1 | -1 |
| Berlin | SK Berlin Mitte | A15-A34 | -1 | 0 | 419,590 | -1 | -9 |
Looks like a mistake to me.
Hm. Indeed - what's the difference between 0 and -1?
I don't want to take up the limited resources of the RKI to figure this out though. I'm sure they get a lot of emails right now.
Is this issue limited to older dates at the beginning of the epidemic?
Nope.
$ csvgrep -c6 -m '-1' RKI_COVID19.csv|csvcut -c2-3,6-9,12-13|csvlook
| Bundesland | Landkreis | AnzahlFall | AnzahlTodesfall | ObjectId | Meldedatum | NeuerFall | NeuerTodesfall |
| ------------------- | ------------------------------------ | ---------- | --------------- | -------- | ------------------------- | --------- | -------------- |
| Niedersachsen | Region Hannover | -1 | 0 | 404,722 | 2020-03-24 00:00:00+00:00 | -1 | -9 |
| Niedersachsen | Region Hannover | -1 | -1 | 404,828 | 2020-03-19 00:00:00+00:00 | -1 | -1 |
| Nordrhein-Westfalen | SK Bonn | -1 | 0 | 406,917 | 2020-03-20 00:00:00+00:00 | -1 | -9 |
| Nordrhein-Westfalen | LK Heinsberg | -1 | 0 | 407,578 | 2020-03-22 00:00:00+00:00 | -1 | -9 |
| Schleswig-Holstein | LK Rendsburg-Eckernförde | -1 | 0 | 404,020 | 2020-03-16 00:00:00+00:00 | -1 | -9 |
| Hamburg | SK Hamburg | -1 | 0 | 404,244 | 2020-03-18 00:00:00+00:00 | -1 | -9 |
| Hamburg | SK Hamburg | -1 | 0 | 404,275 | 2020-03-20 00:00:00+00:00 | -1 | -9 |
| Hamburg | SK Hamburg | -1 | 0 | 404,305 | 2020-03-20 00:00:00+00:00 | -1 | -9 |
| Hamburg | SK Hamburg | -1 | 0 | 404,337 | 2020-03-21 00:00:00+00:00 | -1 | -9 |
| Nordrhein-Westfalen | SK Münster | -1 | 0 | 407,953 | 2020-03-24 00:00:00+00:00 | -1 | -9 |
| Nordrhein-Westfalen | LK Borken | -1 | 0 | 408,109 | 2020-03-21 00:00:00+00:00 | -1 | -9 |
| Nordrhein-Westfalen | LK Coesfeld | -1 | 0 | 408,189 | 2020-03-24 00:00:00+00:00 | -1 | -9 |
| Hessen | LK Waldeck-Frankenberg | -1 | 0 | 410,447 | 2020-03-20 00:00:00+00:00 | -1 | -9 |
| Baden-Württemberg | LK Breisgau-Hochschwarzwald | -1 | 0 | 413,797 | 2020-03-23 00:00:00+00:00 | -1 | -9 |
| Baden-Württemberg | LK Breisgau-Hochschwarzwald | -1 | 0 | 413,800 | 2020-03-24 00:00:00+00:00 | -1 | -9 |
| Baden-Württemberg | LK Tübingen | -1 | 0 | 414,407 | 2020-03-23 00:00:00+00:00 | -1 | -9 |
| Niedersachsen | LK Osnabrück | -1 | 0 | 405,814 | 2020-03-20 00:00:00+00:00 | -1 | -9 |
| Bremen | SK Bremen | -1 | 0 | 405,945 | 2020-03-01 00:00:00+00:00 | -1 | -9 |
| Nordrhein-Westfalen | SK Düsseldorf | -1 | 0 | 406,043 | 2020-03-23 00:00:00+00:00 | -1 | -9 |
| Nordrhein-Westfalen | LK Minden-Lübbecke | -1 | 0 | 408,693 | 2020-03-15 00:00:00+00:00 | -1 | -9 |
| Hessen | SK Offenbach | -1 | 0 | 409,453 | 2020-03-25 00:00:00+00:00 | -1 | -9 |
| Hessen | LK Bergstraße | -1 | 0 | 409,560 | 2020-03-25 00:00:00+00:00 | -1 | -9 |
| Bayern | LK Bad Tölz-Wolfratshausen | -1 | 0 | 415,265 | 2020-03-20 00:00:00+00:00 | -1 | -9 |
| Bayern | LK Ansbach | -1 | 0 | 418,019 | 2020-03-25 00:00:00+00:00 | -1 | -9 |
| Rheinland-Pfalz | SK Ludwigshafen | -1 | 0 | 411,169 | 2020-03-23 00:00:00+00:00 | -1 | -9 |
| Sachsen | SK Dresden | -1 | 0 | 421,246 | 2020-03-23 00:00:00+00:00 | -1 | -9 |
| Berlin | SK Berlin Mitte | -1 | 0 | 419,659 | 2020-03-19 00:00:00+00:00 | -1 | -9 |
| Berlin | SK Berlin Friedrichshain-Kreuzberg | -1 | 0 | 419,678 | 2020-03-25 00:00:00+00:00 | -1 | -9 |
| Berlin | SK Berlin Friedrichshain-Kreuzberg | -1 | 0 | 419,725 | 2020-03-24 00:00:00+00:00 | -1 | -9 |
| Berlin | SK Berlin Friedrichshain-Kreuzberg | -1 | 0 | 419,727 | 2020-03-25 00:00:00+00:00 | -1 | -9 |
| Berlin | SK Berlin Pankow | -1 | 0 | 419,753 | 2020-03-25 00:00:00+00:00 | -1 | -9 |
| Berlin | SK Berlin Charlottenburg-Wilmersdorf | -1 | 0 | 419,897 | 2020-03-24 00:00:00+00:00 | -1 | -9 |
| Baden-Württemberg | LK Ludwigsburg | -1 | 0 | 412,151 | 2020-03-22 00:00:00+00:00 | -1 | -9 |
| Baden-Württemberg | LK Ludwigsburg | -1 | 0 | 412,154 | 2020-03-17 00:00:00+00:00 | -1 | -9 |
| Baden-Württemberg | LK Ludwigsburg | -1 | 0 | 412,179 | 2020-03-13 00:00:00+00:00 | -1 | -9 |
| Baden-Württemberg | LK Ludwigsburg | -1 | -1 | 412,279 | 2020-03-22 00:00:00+00:00 | -1 | -1 |
| Berlin | SK Berlin Mitte | -1 | 0 | 419,590 | 2020-03-24 00:00:00+00:00 | -1 | -9 |
Regarding taking up resources of the RKI: If the data is wrong, it needs to be fixed. Guess what happens if that data is used to model the effect of political decisions.
Ok - fair enough. You write an email then? We're getting the data from here, but it's not immediately clear from this page who to contact.
Yap, already did so at 6:15pm. ;-)
Did you get an answer, @psteinb?
Nope, not yet. I'll wait until Wednesday.
Adding to the picture: https://danielgerber.eu/2020/03/22/corona-zahlen-in-sachsen/ The RKI data appears to be off the local data anyhow. Not sure why, this could be the German federal system at play.
I see. Thank you.
Ja - there seems to be some confusion about the numbers right now. For the moment we can only trust the RKI data. I don't want to add 400 functions to cover every Landkreis separately (@mlange-42).
I wrote a message to the BKG which is (according to the arcgis page) partly responible for the data as well. See also https://www.bkg.bund.de/DE/Ueber-das-BKG/Dienstleistungszentrum/OpenData/OpenData.html
OK, BKG said, they are not responsible and suggested to send a message to ESRI. I have to find out where to target a mail there. At this point, I am a bit lost between all these federal institutions.
Bottom of the arcgis.com page mentioned above leads to this URL: https://www.esri.de/ueber-uns/kontakt Why didn’t I see that earlier?
Well - let's see where this leads. Thank you for getting to the bottom of this!
ESRI has commented today about this. They mentioned that all columns are documented online in the meta data view.
https://www.arcgis.com/home/item.html?id=dd4580c810204019a7b8eb3e0b329dd6
As of today, this page discusses negative values for columns NeuerFall
and NeuerTodesfall
only:
NeuerFall:
0
: Fall ist in der Publikation für den aktuellen Tag und in der für den Vortag enthalten1
: Fall ist nur in der aktuellen Publikation enthalten-1
: Fall ist nur in der Publikation des Vortags enthaltenand
NeuerTodesfall:
0
: Fall ist in der Publikation für den aktuellen Tag und in der für den Vortag jeweils ein Todesfall1
: Fall ist in der aktuellen Publikation ein Todesfall, nicht jedoch in der Publikation des Vortages-1
: Fall ist in der aktuellen Publikation kein Todesfall, jedoch war er in der Publikation des Vortags ein Todesfall-9
: Fall ist weder in der aktuellen Publikation noch in der des Vortages ein Todesfall
While studying this, I also had to learn how to come up with the total amount of diagnosed people in Germany:
damit ergibt sich: Anzahl Fälle der aktuellen Publikation als Summe(AnzahlFall), wenn NeuerFall in (0,1); Delta zum Vortag als Summe(AnzahlFall) wenn NeuerFall in (-1,1)
OK - together with the following information all of this starts to make more sense. I will try to patch our function immediately.
- AnzahlFall: Anzahl der Fälle in der entsprechenden Gruppe
- AnzahlTodesfall: Anzahl der Todesfälle in der entsprechenden Gruppe
Negativ Werte ergeben sich wenn beispielsweise eine Korrektur zu den Vortagen erfolgt (es kann sein, dass gewisse Fälle im Nachgang noch korrigiert werden). Die negativ Werte sind somit notwendig um auf eine korrekte Totalsumme zu kommen.
Please see #24
I already merged, because the result seems to fit now, but please check.
I think this can be closed now.
Hey, I am using this wonderful library in a project of mine. I effectively use it to update the RKI numbers for some of my studies of interest. As of today, I get
NA
values in the columnAnzahlTodesfall
when downloading the most recent numbers.Here is my code
And here is my sessionInfo:
I discovered it when rereading the
csv
into an.Rmd
file.NAs.zip
Here are the 2 rows that contain
NA
s.Fun fact, the direct download of the data from the RKI website, does not have this issue. :/