Closed damson-mj closed 4 years ago
The structure of the dataset changed again. I tried to fix it quickly for you @damson-mj - should work now again. I will have to add the new columns as well.
So the discussion here makes me believe that it may be better to give this some time to settle. I will take a closer look at the new columns tomorrow. PRs are welcome as always btw.
Hi @nevrome , thank you so much! Another think I noticed when using the data: The active cases per Bundesland are all rising although the official data from the RKI for whole Germany seems to be falling for weeks now. Do you have an explanation for that?
This data does not have a column for the active cases. So you calculate that by subtracting the total recovered from the total ill, ja? Could you share your code that made you come to this conclusion? If there is an error it's either because your code is wrong, our code is wrong or the dataset changed in unpredictable ways.
yeah and I also substracted the Number of dead. Here is my code:
data<-covid19germany::get_RKI_timeseries()
Bundeslanddaten <- covid19germany::group_RKI_timeseries(data, Bundesland)
Bundeslanddaten$Active <- Bundeslanddaten$CumNumberTestedIll - Bundeslanddaten$CumNumberRecovered - Bundeslanddaten$CumNumberDead
ggplot(Bundeslanddaten, aes(x=Date, y=Active, group=Bundesland, colour = Bundesland)) +
geom_line(aes(color=Bundesland))+
geom_point(aes(color=Bundesland))
So when I run this
library(magrittr)
rki <- covid19germany::get_RKI_timeseries()
Bundeslanddaten <- covid19germany::group_RKI_timeseries(rki, Bundesland)
Bundeslanddaten %>%
dplyr::group_by(Bundesland) %>%
dplyr::slice(c(dplyr::n())) %>%
dplyr::select(Bundesland, CumNumberTestedIll, CumNumberDead)
I get this table:
# A tibble: 16 x 3
# Groups: Bundesland [16]
Bundesland CumNumberTestedIll CumNumberDead
<chr> <dbl> <dbl>
1 Baden-Württemberg 31609 1353
2 Bayern 42080 1799
3 Berlin 5827 147
4 Brandenburg 2831 113
5 Bremen 827 29
6 Hamburg 4562 155
7 Hessen 8304 353
8 Mecklenburg-Vorpommern 690 17
9 Niedersachsen 10067 416
10 Nordrhein-Westfalen 32687 1219
11 Rheinland-Pfalz 6029 166
12 Saarland 2552 131
13 Sachsen 4561 156
14 Sachsen-Anhalt 1549 43
15 Schleswig-Holstein 2690 106
16 Thüringen 2254 85
which is equal to the data on the RKI-Dashboard:
So with what are you comparing this data and where could the deviation come from?
I think there is something wrong with the CumNumberRecorvered-Series.
If I add the numbers for all Bundesländer for 20-04-06 for example I get a total of 95762 recovered in Germany. The correct number is under 30000 according to media reports. (I also checked the numbers for the ill and dead - they seem to be correct for this date).
Hm. So this code here gives back the correct total number 123545
for today, I think:
Bundeslanddaten %>% dplyr::group_by(Bundesland) %>% dplyr::slice(c(dplyr::n())) %$% sum(CumNumberRecovered)
Can you share your code to reproduce the wrong number? Sorry - I think I'm a little slow today
ok, here it is:
data<-covid19germany::get_RKI_timeseries()
Bundeslanddaten <- covid19germany::group_RKI_timeseries(data, Bundesland)
colSums(subset(Bundeslanddaten, format(Bundeslanddaten$Date, format="%y/%m/%d") == as.Date("20-04-06"))[8])
colSums(subset(Bundeslanddaten, format(Bundeslanddaten$Date, format="%y/%m/%d") == as.Date("20-04-29"))[8])
This gives me a number of 123545 recovered for yesterday (which is correct) and a number of 95762 recovered for April 6th, which is wrong according to this source: https://www.welt.de/vermischtes/article207628089/RKI-zu-Corona-Die-Uebersterblichkeit-steigt-in-Deutschland.html (it should be 28700)
I see the issue now, I believe. CumNumberRecovered
is the number of people that recovered from a particular day, whereas welt.de plots the number of people that recovered on a particular day. Is there a data transformation to get the latter from the first? Does this make sense? I should take a look at this with fresh eyes tomorrow, but right now I think there is no error in the package code.
data <- covid19germany::get_RKI_timeseries()
hu <- data %>%
dplyr::group_by(
.data[["Date"]]
) %>%
dplyr::summarise(
NumberNewTestedIll = sum(NumberNewTestedIll, na.rm = T),
NumberNewDead = sum(NumberNewDead, na.rm = T),
NumberNewRecovered = sum(NumberNewRecovered, na.rm = T)
) %>%
dplyr::mutate(
cum_testedill = cumsum(NumberNewTestedIll),
cum_dead = cumsum(NumberNewDead),
cum_recovered = cumsum(NumberNewRecovered)
) %>%
tidyr::pivot_longer(cols = c("cum_testedill", "cum_dead", "cum_recovered"))
library(ggplot2)
hu %>%
ggplot() +
geom_line(aes(Date, value, color = name))
For the record: I also added the new column StartOfDiseaseDate now. The new column Altersgruppe2 does not contain info yet.
This is totally awesome. I wanted to ask about that given that the RKI uses exactly that for modelling. Thanks for including it now!
I get the follwing error:
covid19germany::get_RKI_timeseries() Downloading file... Warnung: 123396 parsing failures. row col expected actual file 1 Meldedatum date like 2020/03/14 00:00:00 'https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv' 2 Meldedatum date like 2020/03/19 00:00:00 'https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv' 3 Meldedatum date like 2020/03/19 00:00:00 'https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv' 4 Meldedatum date like 2020/03/21 00:00:00 'https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv' 5 Meldedatum date like 2020/03/27 00:00:00 'https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv' ... .......... .......... ................... ............................................................................. See problems(...) for more details.
Fehler: Column
ObjectId
not found in.data
Runrlang::last_error()
to see where the error occurred. Zusätzlich: Warnmeldung: The following named parsers don't match the column names: ObjectId