Open slawomirmatuszak opened 3 years ago
Oh - maybe the data structure provided by the RKI changed yet again? Please show me your code. Maybe I can figure out where the problem might be coming from
Code:
library(tidyverse)
library(covid19germany)
df <- get_RKI_timeseries()
max.date<- group_RKI_timeseries(df)%>%
tail()
max.date
As you can see, the number of new cases from 3rd October is only 961. It is not a problem with your function. When I do summary usign dplyr I get same results. Yesterday it was similiar - there was 1300 new cases on 2nd October. But today, it appears that this figure has been updated ( 2353). Cumulative figures appear to be correct.
There are multiple ways to calculate the number of new cases and the RKI updates their dataset for past days as well. It's a pretty confusing dataset, honestly.
If you go here and click on More
you get a description of the raw dataset and its columns (in german). get_RKI_timeseries()
yields a slightly simplified version of this. I'm not sure though, why "our" numbers for the current day lag behind the RKI data. Probably it has something to do with the reported dates, where the dataset distinguishes between "Meldedatum", "Referenzdatum" and "Erkrankungsdatum".
You can download the raw version of the dataset with get_RKI_timeseries(raw_only = T)
. Maybe you can figure out what causes this difference. I will take a look as well.
Ha - I think understand it now. This sentence from the german version of the daily RKI report is crucial:
Die Differenz zum Vortag bezieht sich auf Fälle, die dem RKI täglich übermittelt werden. Dies beinhaltet Fälle, die am gleichen Tag oder bereits an früheren Tagen an das Gesundheitsamt gemeldet worden sind.
That means the RKI reports what it learns from the local health authorities. This data might be from previous days, so it is counted towards these previous days in the "Meldedatum" column. To calculate the "Change from previous day" number the RKI shows in the dashboard and the reports we would need an additional column "Date when reported from the local health authorities to the RKI".
I think that most likely your explanation is correct. I’ve tried to compare your dataset, raw data form your package and data from arcgis. It was aggregated by date.
`library(tidyverse) library(lubridate) library(covid19germany)
df <- covid19germany::get_RKI_timeseries() grouped <- group_RKI_timeseries(df)%>% arrange(desc(Date))%>% head()
df2 <- read_csv("https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv") arcgis <- df2 %>% mutate(Meldedatum = ymd_hms(Meldedatum))%>% group_by(Meldedatum)%>% summarise(AnzahlFall=sum(AnzahlFall))%>% mutate(AnzahlFall.cum = cumsum(AnzahlFall))%>% arrange(desc(Meldedatum))%>% head()
raw.data <- get_RKI_timeseries(raw_only = T)
Refdatum <- raw.data %>% group_by(Refdatum)%>% summarise(AnzahlFall=sum(AnzahlFall))%>% mutate(AnzahlFall.cum = cumsum(AnzahlFall))%>% arrange(desc(Refdatum))%>% head()
Meldedatum <- raw.data %>% group_by(Meldedatum)%>% summarise(AnzahlFall=sum(AnzahlFall))%>% mutate(AnzahlFall.cum = cumsum(AnzahlFall))%>% arrange(desc(Meldedatum))%>% head()`
New daily cases from the latest day are always wrong. What’s interesting , cumulative number of cases from raw data and arcgis data is incorrect, while figure from your dataset is the same as on RKI dashboard.
Alright - good that you did this test. For the cumulative number you have to consider the encoding in the NeuerFall
column. Maybe this causes the difference between your code and what the package and the dashboard report.
Yesterday I noticed, that after summarising your dataset, it turned out 1300 new cases only, while on RKI dashboard it was over 2000. Today is the same - 961 new cases (RKI shows 2279), no new deaths (RKI - 2). However, cumulative numbers apear to be correct (same as on dashboard). What's interesting, when I filtered data to previous day it looks as if it had been updated. Instead 1300 cases yesterday shows 2353. Any reason why is like that?