nevrome / covid19germany

R package - Load, visualise and analyse daily updated data on the COVID-19 outbreak in Germany
Other
46 stars 8 forks source link

Error: Column `ObjectId` not found in `.data` #28

Closed damson-mj closed 4 years ago

damson-mj commented 4 years ago

I get the follwing error:

covid19germany::get_RKI_timeseries() Downloading file... Warnung: 123396 parsing failures. row col expected actual file 1 Meldedatum date like 2020/03/14 00:00:00 'https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv' 2 Meldedatum date like 2020/03/19 00:00:00 'https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv' 3 Meldedatum date like 2020/03/19 00:00:00 'https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv' 4 Meldedatum date like 2020/03/21 00:00:00 'https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv' 5 Meldedatum date like 2020/03/27 00:00:00 'https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv' ... .......... .......... ................... ............................................................................. See problems(...) for more details.

Fehler: Column ObjectId not found in .data Run rlang::last_error() to see where the error occurred. Zusätzlich: Warnmeldung: The following named parsers don't match the column names: ObjectId

nevrome commented 4 years ago

The structure of the dataset changed again. I tried to fix it quickly for you @damson-mj - should work now again. I will have to add the new columns as well.

nevrome commented 4 years ago

So the discussion here makes me believe that it may be better to give this some time to settle. I will take a closer look at the new columns tomorrow. PRs are welcome as always btw.

damson-mj commented 4 years ago

Hi @nevrome , thank you so much! Another think I noticed when using the data: The active cases per Bundesland are all rising although the official data from the RKI for whole Germany seems to be falling for weeks now. Do you have an explanation for that?

nevrome commented 4 years ago

This data does not have a column for the active cases. So you calculate that by subtracting the total recovered from the total ill, ja? Could you share your code that made you come to this conclusion? If there is an error it's either because your code is wrong, our code is wrong or the dataset changed in unpredictable ways.

damson-mj commented 4 years ago

yeah and I also substracted the Number of dead. Here is my code:

data<-covid19germany::get_RKI_timeseries()
Bundeslanddaten <- covid19germany::group_RKI_timeseries(data, Bundesland)

Bundeslanddaten$Active <- Bundeslanddaten$CumNumberTestedIll - Bundeslanddaten$CumNumberRecovered - Bundeslanddaten$CumNumberDead

ggplot(Bundeslanddaten, aes(x=Date, y=Active, group=Bundesland, colour = Bundesland)) +
  geom_line(aes(color=Bundesland))+
  geom_point(aes(color=Bundesland))
nevrome commented 4 years ago

So when I run this

library(magrittr)
rki <- covid19germany::get_RKI_timeseries()
Bundeslanddaten <- covid19germany::group_RKI_timeseries(rki, Bundesland)
Bundeslanddaten %>% 
  dplyr::group_by(Bundesland) %>% 
  dplyr::slice(c(dplyr::n())) %>% 
  dplyr::select(Bundesland, CumNumberTestedIll, CumNumberDead)

I get this table:

# A tibble: 16 x 3
# Groups:   Bundesland [16]
   Bundesland             CumNumberTestedIll CumNumberDead
   <chr>                               <dbl>         <dbl>
 1 Baden-Württemberg                   31609          1353
 2 Bayern                              42080          1799
 3 Berlin                               5827           147
 4 Brandenburg                          2831           113
 5 Bremen                                827            29
 6 Hamburg                              4562           155
 7 Hessen                               8304           353
 8 Mecklenburg-Vorpommern                690            17
 9 Niedersachsen                       10067           416
10 Nordrhein-Westfalen                 32687          1219
11 Rheinland-Pfalz                      6029           166
12 Saarland                             2552           131
13 Sachsen                              4561           156
14 Sachsen-Anhalt                       1549            43
15 Schleswig-Holstein                   2690           106
16 Thüringen                            2254            85

which is equal to the data on the RKI-Dashboard:

image

So with what are you comparing this data and where could the deviation come from?

damson-mj commented 4 years ago

I think there is something wrong with the CumNumberRecorvered-Series.

If I add the numbers for all Bundesländer for 20-04-06 for example I get a total of 95762 recovered in Germany. The correct number is under 30000 according to media reports. (I also checked the numbers for the ill and dead - they seem to be correct for this date).

nevrome commented 4 years ago

Hm. So this code here gives back the correct total number 123545 for today, I think:

Bundeslanddaten %>% dplyr::group_by(Bundesland) %>% dplyr::slice(c(dplyr::n())) %$% sum(CumNumberRecovered)

image

Can you share your code to reproduce the wrong number? Sorry - I think I'm a little slow today

damson-mj commented 4 years ago

ok, here it is:

data<-covid19germany::get_RKI_timeseries()

Bundeslanddaten <- covid19germany::group_RKI_timeseries(data, Bundesland)

colSums(subset(Bundeslanddaten, format(Bundeslanddaten$Date, format="%y/%m/%d") == as.Date("20-04-06"))[8])

colSums(subset(Bundeslanddaten, format(Bundeslanddaten$Date, format="%y/%m/%d") == as.Date("20-04-29"))[8])

This gives me a number of 123545 recovered for yesterday (which is correct) and a number of 95762 recovered for April 6th, which is wrong according to this source: https://www.welt.de/vermischtes/article207628089/RKI-zu-Corona-Die-Uebersterblichkeit-steigt-in-Deutschland.html (it should be 28700)

nevrome commented 4 years ago

I see the issue now, I believe. CumNumberRecovered is the number of people that recovered from a particular day, whereas welt.de plots the number of people that recovered on a particular day. Is there a data transformation to get the latter from the first? Does this make sense? I should take a look at this with fresh eyes tomorrow, but right now I think there is no error in the package code.

data <- covid19germany::get_RKI_timeseries()

hu <- data %>%
  dplyr::group_by(
    .data[["Date"]]
  ) %>% 
  dplyr::summarise(
    NumberNewTestedIll = sum(NumberNewTestedIll, na.rm = T),
    NumberNewDead = sum(NumberNewDead, na.rm = T),
    NumberNewRecovered = sum(NumberNewRecovered, na.rm = T)
  ) %>%
  dplyr::mutate(
    cum_testedill = cumsum(NumberNewTestedIll),
    cum_dead = cumsum(NumberNewDead),
    cum_recovered = cumsum(NumberNewRecovered)
  ) %>%
  tidyr::pivot_longer(cols = c("cum_testedill", "cum_dead", "cum_recovered"))

library(ggplot2)
hu %>%
  ggplot() +
  geom_line(aes(Date, value, color = name))

image

nevrome commented 4 years ago

For the record: I also added the new column StartOfDiseaseDate now. The new column Altersgruppe2 does not contain info yet.

stschiff commented 4 years ago

This is totally awesome. I wanted to ask about that given that the RKI uses exactly that for modelling. Thanks for including it now!