nevrome / covid19germany

R package - Load, visualise and analyse daily updated data on the COVID-19 outbreak in Germany
Other
46 stars 8 forks source link

NA in `AnzahlTodesfall` today #18

Closed psteinb closed 4 years ago

psteinb commented 4 years ago

Hey, I am using this wonderful library in a project of mine. I effectively use it to update the RKI numbers for some of my studies of interest. As of today, I get NA values in the column AnzahlTodesfall when downloading the most recent numbers.

Here is my code

#from https://github.com/nevrome/covid19germany
library(covid19germany)

#from cran
library(optparse)
library(readr)
library(dplyr)

## DEFINING COMMAND LINE INTERFACE
parser <- OptionParser()
option_list <- list(
  make_option(c('-o','--output'),
              default='RKI_COVID19.csv',
              help='output file RKI data [default %default]')#,
)
opts = parse_args(OptionParser(option_list=option_list))

if (is.null(opts$output)){
  print_help(parser)
  stop("At least one argument must be supplied (input file).n", call.=FALSE)
}

df = covid19germany::get_RKI_timeseries()
glimpse(df)
write.csv(df,opts$output)

And here is my sessionInfo:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 31 (Workstation Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3

I discovered it when rereading the csv into an .Rmd file.

NAs.zip

Here are the 2 rows that contain NAs.

##       X IdBundesland        Bundesland       Landkreis Altersgruppe Geschlecht
## 1   194            3     Niedersachsen Region Hannover      A60-A79          M
## 2 15645            8 Baden-Württemberg  LK Ludwigsburg         A80+          W
##   AnzahlFall AnzahlTodesfall ObjectId Meldedatum IdLandkreis Datenstand
## 1         NA              NA   404828       <NA>        3241 2020-03-27
## 2         NA              NA   412279       <NA>        8118 2020-03-27
##   NeuerFall NeuerTodesfall Meldedatum_Epoch_days Meldedatum_Epoch
## 1        NA             NA                    NA               NA
## 2        NA             NA                    NA               NA

Fun fact, the direct download of the data from the RKI website, does not have this issue. :/

nevrome commented 4 years ago

This is expected behaviour. That's how these entries in the csv look like:

3 | Niedersachsen | Region Hannover | A60-A79 | M | -1 | -1 | 404828 | 2020-03-19T00:00:00.000Z | 3241 | 27.03.2020 00:00 | -1 | -1

8 | Baden-Württemberg | LK Ludwigsburg | A80+ | W | -1 | -1 | 412279 | 2020-03-22T00:00:00.000Z | 8118 | 27.03.2020 00:00 | -1 | -1

In the columns AnzahlFall and AnzahlTodesfall we have -1 and that equals afaik NA. Is this wrong or am I missing something?

psteinb commented 4 years ago

Good point. I just checked again in the downloaded file. Indeed, the -1 is there. Then I have to check why this doesn't pop up in my validation checks.

Any idea what that means?

nevrome commented 4 years ago

What are you validating and how?

psteinb commented 4 years ago

Please don't interpret this too much, I'm referring to a simple:

if(nrow(is.na(df $>$ filter(is.na(AnzahlTodesfall)))){
stop("...")
}

Are you in contact with RKI by any chance to report these rows with -1?

nevrome commented 4 years ago

These -1 rows don't have to be a mistake. My understanding is that they indicate explicit null-observations, which is fine.

psteinb commented 4 years ago

I don't understand, then why not use 0? And why are these -1 in both columns:

| Bundesland        | Landkreis       | Altersgruppe | AnzahlFall | AnzahlTodesfall | ObjectId |                Meldedatum | NeuerFall | NeuerTodesfall |
| ----------------- | --------------- | ------------ | ---------- | --------------- | -------- | ------------------------- | --------- | -------------- |
| Niedersachsen     | Region Hannover | A60-A79      |         -1 |              -1 |  404,828 | 2020-03-19 00:00:00+00:00 |        -1 |             -1 |
| Baden-Württemberg | LK Ludwigsburg  | A80+         |         -1 |              -1 |  412,279 | 2020-03-22 00:00:00+00:00 |        -1 |             -1 |

And there is more if you look at rows with AnzahlFall < 0

$ csvgrep -c6 -m '-1' RKI_COVID19.csv|csvcut -c2-4,6-8,12-13|csvlook
| Bundesland          | Landkreis                            | Altersgruppe | AnzahlFall | AnzahlTodesfall | ObjectId | NeuerFall | NeuerTodesfall |
| ------------------- | ------------------------------------ | ------------ | ---------- | --------------- | -------- | --------- | -------------- |
| Niedersachsen       | Region Hannover                      | A00-A04      |         -1 |               0 |  404,722 |        -1 |             -9 |
| Niedersachsen       | Region Hannover                      | A60-A79      |         -1 |              -1 |  404,828 |        -1 |             -1 |
| Nordrhein-Westfalen | SK Bonn                              | A60-A79      |         -1 |               0 |  406,917 |        -1 |             -9 |
| Nordrhein-Westfalen | LK Heinsberg                         | A60-A79      |         -1 |               0 |  407,578 |        -1 |             -9 |
| Schleswig-Holstein  | LK Rendsburg-Eckernförde             | A60-A79      |         -1 |               0 |  404,020 |        -1 |             -9 |
| Hamburg             | SK Hamburg                           | A15-A34      |         -1 |               0 |  404,244 |        -1 |             -9 |
| Hamburg             | SK Hamburg                           | A15-A34      |         -1 |               0 |  404,275 |        -1 |             -9 |
| Hamburg             | SK Hamburg                           | A35-A59      |         -1 |               0 |  404,305 |        -1 |             -9 |
| Hamburg             | SK Hamburg                           | A35-A59      |         -1 |               0 |  404,337 |        -1 |             -9 |
| Nordrhein-Westfalen | SK Münster                           | A35-A59      |         -1 |               0 |  407,953 |        -1 |             -9 |
| Nordrhein-Westfalen | LK Borken                            | A80+         |         -1 |               0 |  408,109 |        -1 |             -9 |
| Nordrhein-Westfalen | LK Coesfeld                          | unbekannt    |         -1 |               0 |  408,189 |        -1 |             -9 |
| Hessen              | LK Waldeck-Frankenberg               | A35-A59      |         -1 |               0 |  410,447 |        -1 |             -9 |
| Baden-Württemberg   | LK Breisgau-Hochschwarzwald          | A80+         |         -1 |               0 |  413,797 |        -1 |             -9 |
| Baden-Württemberg   | LK Breisgau-Hochschwarzwald          | unbekannt    |         -1 |               0 |  413,800 |        -1 |             -9 |
| Baden-Württemberg   | LK Tübingen                          | A35-A59      |         -1 |               0 |  414,407 |        -1 |             -9 |
| Niedersachsen       | LK Osnabrück                         | A60-A79      |         -1 |               0 |  405,814 |        -1 |             -9 |
| Bremen              | SK Bremen                            | A35-A59      |         -1 |               0 |  405,945 |        -1 |             -9 |
| Nordrhein-Westfalen | SK Düsseldorf                        | A35-A59      |         -1 |               0 |  406,043 |        -1 |             -9 |
| Nordrhein-Westfalen | LK Minden-Lübbecke                   | A15-A34      |         -1 |               0 |  408,693 |        -1 |             -9 |
| Hessen              | SK Offenbach                         | A15-A34      |         -1 |               0 |  409,453 |        -1 |             -9 |
| Hessen              | LK Bergstraße                        | unbekannt    |         -1 |               0 |  409,560 |        -1 |             -9 |
| Bayern              | LK Bad Tölz-Wolfratshausen           | A15-A34      |         -1 |               0 |  415,265 |        -1 |             -9 |
| Bayern              | LK Ansbach                           | A60-A79      |         -1 |               0 |  418,019 |        -1 |             -9 |
| Rheinland-Pfalz     | SK Ludwigshafen                      | A15-A34      |         -1 |               0 |  411,169 |        -1 |             -9 |
| Sachsen             | SK Dresden                           | A35-A59      |         -1 |               0 |  421,246 |        -1 |             -9 |
| Berlin              | SK Berlin Mitte                      | unbekannt    |         -1 |               0 |  419,659 |        -1 |             -9 |
| Berlin              | SK Berlin Friedrichshain-Kreuzberg   | A15-A34      |         -1 |               0 |  419,678 |        -1 |             -9 |
| Berlin              | SK Berlin Friedrichshain-Kreuzberg   | A35-A59      |         -1 |               0 |  419,725 |        -1 |             -9 |
| Berlin              | SK Berlin Friedrichshain-Kreuzberg   | A35-A59      |         -1 |               0 |  419,727 |        -1 |             -9 |
| Berlin              | SK Berlin Pankow                     | A15-A34      |         -1 |               0 |  419,753 |        -1 |             -9 |
| Berlin              | SK Berlin Charlottenburg-Wilmersdorf | A60-A79      |         -1 |               0 |  419,897 |        -1 |             -9 |
| Baden-Württemberg   | LK Ludwigsburg                       | A00-A04      |         -1 |               0 |  412,151 |        -1 |             -9 |
| Baden-Württemberg   | LK Ludwigsburg                       | A05-A14      |         -1 |               0 |  412,154 |        -1 |             -9 |
| Baden-Württemberg   | LK Ludwigsburg                       | A15-A34      |         -1 |               0 |  412,179 |        -1 |             -9 |
| Baden-Württemberg   | LK Ludwigsburg                       | A80+         |         -1 |              -1 |  412,279 |        -1 |             -1 |
| Berlin              | SK Berlin Mitte                      | A15-A34      |         -1 |               0 |  419,590 |        -1 |             -9 |

Looks like a mistake to me.

nevrome commented 4 years ago

Hm. Indeed - what's the difference between 0 and -1?

I don't want to take up the limited resources of the RKI to figure this out though. I'm sure they get a lot of emails right now.

Is this issue limited to older dates at the beginning of the epidemic?

psteinb commented 4 years ago

Nope.

$ csvgrep -c6 -m '-1' RKI_COVID19.csv|csvcut -c2-3,6-9,12-13|csvlook
| Bundesland          | Landkreis                            | AnzahlFall | AnzahlTodesfall | ObjectId |                Meldedatum | NeuerFall | NeuerTodesfall |
| ------------------- | ------------------------------------ | ---------- | --------------- | -------- | ------------------------- | --------- | -------------- |
| Niedersachsen       | Region Hannover                      |         -1 |               0 |  404,722 | 2020-03-24 00:00:00+00:00 |        -1 |             -9 |
| Niedersachsen       | Region Hannover                      |         -1 |              -1 |  404,828 | 2020-03-19 00:00:00+00:00 |        -1 |             -1 |
| Nordrhein-Westfalen | SK Bonn                              |         -1 |               0 |  406,917 | 2020-03-20 00:00:00+00:00 |        -1 |             -9 |
| Nordrhein-Westfalen | LK Heinsberg                         |         -1 |               0 |  407,578 | 2020-03-22 00:00:00+00:00 |        -1 |             -9 |
| Schleswig-Holstein  | LK Rendsburg-Eckernförde             |         -1 |               0 |  404,020 | 2020-03-16 00:00:00+00:00 |        -1 |             -9 |
| Hamburg             | SK Hamburg                           |         -1 |               0 |  404,244 | 2020-03-18 00:00:00+00:00 |        -1 |             -9 |
| Hamburg             | SK Hamburg                           |         -1 |               0 |  404,275 | 2020-03-20 00:00:00+00:00 |        -1 |             -9 |
| Hamburg             | SK Hamburg                           |         -1 |               0 |  404,305 | 2020-03-20 00:00:00+00:00 |        -1 |             -9 |
| Hamburg             | SK Hamburg                           |         -1 |               0 |  404,337 | 2020-03-21 00:00:00+00:00 |        -1 |             -9 |
| Nordrhein-Westfalen | SK Münster                           |         -1 |               0 |  407,953 | 2020-03-24 00:00:00+00:00 |        -1 |             -9 |
| Nordrhein-Westfalen | LK Borken                            |         -1 |               0 |  408,109 | 2020-03-21 00:00:00+00:00 |        -1 |             -9 |
| Nordrhein-Westfalen | LK Coesfeld                          |         -1 |               0 |  408,189 | 2020-03-24 00:00:00+00:00 |        -1 |             -9 |
| Hessen              | LK Waldeck-Frankenberg               |         -1 |               0 |  410,447 | 2020-03-20 00:00:00+00:00 |        -1 |             -9 |
| Baden-Württemberg   | LK Breisgau-Hochschwarzwald          |         -1 |               0 |  413,797 | 2020-03-23 00:00:00+00:00 |        -1 |             -9 |
| Baden-Württemberg   | LK Breisgau-Hochschwarzwald          |         -1 |               0 |  413,800 | 2020-03-24 00:00:00+00:00 |        -1 |             -9 |
| Baden-Württemberg   | LK Tübingen                          |         -1 |               0 |  414,407 | 2020-03-23 00:00:00+00:00 |        -1 |             -9 |
| Niedersachsen       | LK Osnabrück                         |         -1 |               0 |  405,814 | 2020-03-20 00:00:00+00:00 |        -1 |             -9 |
| Bremen              | SK Bremen                            |         -1 |               0 |  405,945 | 2020-03-01 00:00:00+00:00 |        -1 |             -9 |
| Nordrhein-Westfalen | SK Düsseldorf                        |         -1 |               0 |  406,043 | 2020-03-23 00:00:00+00:00 |        -1 |             -9 |
| Nordrhein-Westfalen | LK Minden-Lübbecke                   |         -1 |               0 |  408,693 | 2020-03-15 00:00:00+00:00 |        -1 |             -9 |
| Hessen              | SK Offenbach                         |         -1 |               0 |  409,453 | 2020-03-25 00:00:00+00:00 |        -1 |             -9 |
| Hessen              | LK Bergstraße                        |         -1 |               0 |  409,560 | 2020-03-25 00:00:00+00:00 |        -1 |             -9 |
| Bayern              | LK Bad Tölz-Wolfratshausen           |         -1 |               0 |  415,265 | 2020-03-20 00:00:00+00:00 |        -1 |             -9 |
| Bayern              | LK Ansbach                           |         -1 |               0 |  418,019 | 2020-03-25 00:00:00+00:00 |        -1 |             -9 |
| Rheinland-Pfalz     | SK Ludwigshafen                      |         -1 |               0 |  411,169 | 2020-03-23 00:00:00+00:00 |        -1 |             -9 |
| Sachsen             | SK Dresden                           |         -1 |               0 |  421,246 | 2020-03-23 00:00:00+00:00 |        -1 |             -9 |
| Berlin              | SK Berlin Mitte                      |         -1 |               0 |  419,659 | 2020-03-19 00:00:00+00:00 |        -1 |             -9 |
| Berlin              | SK Berlin Friedrichshain-Kreuzberg   |         -1 |               0 |  419,678 | 2020-03-25 00:00:00+00:00 |        -1 |             -9 |
| Berlin              | SK Berlin Friedrichshain-Kreuzberg   |         -1 |               0 |  419,725 | 2020-03-24 00:00:00+00:00 |        -1 |             -9 |
| Berlin              | SK Berlin Friedrichshain-Kreuzberg   |         -1 |               0 |  419,727 | 2020-03-25 00:00:00+00:00 |        -1 |             -9 |
| Berlin              | SK Berlin Pankow                     |         -1 |               0 |  419,753 | 2020-03-25 00:00:00+00:00 |        -1 |             -9 |
| Berlin              | SK Berlin Charlottenburg-Wilmersdorf |         -1 |               0 |  419,897 | 2020-03-24 00:00:00+00:00 |        -1 |             -9 |
| Baden-Württemberg   | LK Ludwigsburg                       |         -1 |               0 |  412,151 | 2020-03-22 00:00:00+00:00 |        -1 |             -9 |
| Baden-Württemberg   | LK Ludwigsburg                       |         -1 |               0 |  412,154 | 2020-03-17 00:00:00+00:00 |        -1 |             -9 |
| Baden-Württemberg   | LK Ludwigsburg                       |         -1 |               0 |  412,179 | 2020-03-13 00:00:00+00:00 |        -1 |             -9 |
| Baden-Württemberg   | LK Ludwigsburg                       |         -1 |              -1 |  412,279 | 2020-03-22 00:00:00+00:00 |        -1 |             -1 |
| Berlin              | SK Berlin Mitte                      |         -1 |               0 |  419,590 | 2020-03-24 00:00:00+00:00 |        -1 |             -9 |
psteinb commented 4 years ago

Regarding taking up resources of the RKI: If the data is wrong, it needs to be fixed. Guess what happens if that data is used to model the effect of political decisions.

nevrome commented 4 years ago

Ok - fair enough. You write an email then? We're getting the data from here, but it's not immediately clear from this page who to contact.

psteinb commented 4 years ago

Yap, already did so at 6:15pm. ;-)

nevrome commented 4 years ago

Did you get an answer, @psteinb?

psteinb commented 4 years ago

Nope, not yet. I'll wait until Wednesday.

Adding to the picture: https://danielgerber.eu/2020/03/22/corona-zahlen-in-sachsen/ The RKI data appears to be off the local data anyhow. Not sure why, this could be the German federal system at play.

nevrome commented 4 years ago

I see. Thank you.

Ja - there seems to be some confusion about the numbers right now. For the moment we can only trust the RKI data. I don't want to add 400 functions to cover every Landkreis separately (@mlange-42).

psteinb commented 4 years ago

I wrote a message to the BKG which is (according to the arcgis page) partly responible for the data as well. See also https://www.bkg.bund.de/DE/Ueber-das-BKG/Dienstleistungszentrum/OpenData/OpenData.html

psteinb commented 4 years ago

OK, BKG said, they are not responsible and suggested to send a message to ESRI. I have to find out where to target a mail there. At this point, I am a bit lost between all these federal institutions.

psteinb commented 4 years ago

Bottom of the arcgis.com page mentioned above leads to this URL: https://www.esri.de/ueber-uns/kontakt Why didn’t I see that earlier?

nevrome commented 4 years ago

Well - let's see where this leads. Thank you for getting to the bottom of this!

psteinb commented 4 years ago

ESRI has commented today about this. They mentioned that all columns are documented online in the meta data view. https://www.arcgis.com/home/item.html?id=dd4580c810204019a7b8eb3e0b329dd6 As of today, this page discusses negative values for columns NeuerFall and NeuerTodesfall only:

NeuerFall: 0: Fall ist in der Publikation für den aktuellen Tag und in der für den Vortag enthalten 1: Fall ist nur in der aktuellen Publikation enthalten -1: Fall ist nur in der Publikation des Vortags enthalten

and

NeuerTodesfall:

0: Fall ist in der Publikation für den aktuellen Tag und in der für den Vortag jeweils ein Todesfall 1: Fall ist in der aktuellen Publikation ein Todesfall, nicht jedoch in der Publikation des Vortages -1: Fall ist in der aktuellen Publikation kein Todesfall, jedoch war er in der Publikation des Vortags ein Todesfall -9: Fall ist weder in der aktuellen Publikation noch in der des Vortages ein Todesfall

While studying this, I also had to learn how to come up with the total amount of diagnosed people in Germany:

damit ergibt sich: Anzahl Fälle der aktuellen Publikation als Summe(AnzahlFall), wenn NeuerFall in (0,1); Delta zum Vortag als Summe(AnzahlFall) wenn NeuerFall in (-1,1)

nevrome commented 4 years ago

OK - together with the following information all of this starts to make more sense. I will try to patch our function immediately.

  • AnzahlFall: Anzahl der Fälle in der entsprechenden Gruppe
  • AnzahlTodesfall: Anzahl der Todesfälle in der entsprechenden Gruppe

Negativ Werte ergeben sich wenn beispielsweise eine Korrektur zu den Vortagen erfolgt (es kann sein, dass gewisse Fälle im Nachgang noch korrigiert werden). Die negativ Werte sind somit notwendig um auf eine korrekte Totalsumme zu kommen.

nevrome commented 4 years ago

Please see #24

I already merged, because the result seems to fit now, but please check.

psteinb commented 4 years ago

I think this can be closed now.