nevrome / covid19germany

R package - Load, visualise and analyse daily updated data on the COVID-19 outbreak in Germany
Other
46 stars 8 forks source link

Implausible data (from RKI?) #38

Closed wulms closed 3 years ago

wulms commented 3 years ago

Hey guys,

when I compare the Deaths and Cases of Covid19 of the last days, there are very low numbers of cases.

Example: Yesterday (17.12.2020) there were around 813 deaths and 33777 cases (from the Dashboard) In my analysis there are 20 Deaths and 18849 cases.

Now comes the crazy thing: If I calculate the total sum - I have the same result, than mentioned in the Dashboard! Yesterday (17.12.2020) a total of 24.938 deaths and 1.439.938 cases.

Example with your code, limited up to 07.12.2020 - data for whole Germany.

covid19germany::group_RKI_timeseries(rki) %>% filter(Date > as.Date("2020-12-07"))

This is my code:

covid_ger <- covid19germany::get_RKI_timeseries()  %>%
  mutate(Gender = as.character(Gender),
         Gender = ifelse(is.na(Gender), yes = "missing", no = Gender),
         Age = as.character(Age),
         Age = ifelse(is.na(Age), yes = "missing", no = Age)
  ) %>%
  rename(Covid_Case = NumberNewTestedIll,
         Covid_Dead = NumberNewDead,
         Covid_Recovered = NumberNewRecovered) %>%
  mutate(Date_Week = lubridate::week(Date),
         Date_Weekday = lubridate::wday(Date, label = TRUE)
  ) %>%
  relocate(contains("Date"), 
           Bundesland, Landkreis, contains("AreaKm"), 
           contains("_bundesland"), contains("_landkreis"))

# summary per day
covid_ger %>% 
  group_by(Date) %>%
  summarise(Covid_Deaths = sum(Covid_Dead),
                        Covid_Cases = sum(Covid_Case)) %>% 
  ungroup() 

# test2
covid_ger %>% 
  group_by(Date) %>%
  summarise(Covid_Deaths = sum(Covid_Dead),
            Covid_Cases = sum(Covid_Case)) %>% 
  ungroup() %>%
  summarise(Covid_Deaths = sum(Covid_Deaths),
            Covid_Cases = sum(Covid_Cases))

# total (all cases and deaths over all data)
covid_ger %>% 
  summarise(Covid_Deaths = sum(Covid_Dead),
                        Covid_Cases = sum(Covid_Case))

# yesterday
covid_ger %>% filter(Date == as.Date("2020-12-17")) %>%
  summarise(Covid_Deaths = sum(Covid_Dead))

Best regards, Niklas Wulms

nevrome commented 3 years ago

Did you see this pinned issue here: https://github.com/nevrome/covid19germany/issues/31? Is this the same issue?

wulms commented 3 years ago

Not directly, but I compared it now with the datahub and the raw code...

The same numbers persist throughout the different analyses?

But the absolute numbers stay the same.

Here compared with data from 17 and 18.12.2020. There is a lot of updating going on in the data at lots of dates!

Screenshot from 2020-12-19 12-52-23

Screenshot from 2020-12-18 13-44-20

Screenshot from 2020-12-18 13-44-35

Code for calculating from the datahub: RKI_COVID19.csv

covid_ger_csv <- readr::read_csv("tasks/RKI_COVID19.csv") %>%
  select(Meldedatum, contains("Anzahl"), contains("Neuer")) %>%
  rename(Date = Meldedatum,
         Covid_Case = AnzahlFall,
         Covid_Death = AnzahlTodesfall,
         Covid_Case2 = NeuerFall,
         Covid_Death2 = NeuerTodesfall)

# summary per day
covid_ger_csv %>% 
  group_by(Date) %>%
  summarise(Covid_Deaths = sum(Covid_Death),
            Covid_Cases = sum(Covid_Case),
            Covid_Deaths2 = sum(Covid_Death2),
            Covid_Cases2 = sum(Covid_Case2)) %>% 
  ungroup() 

# test2
covid_ger_csv %>% 
  group_by(Date) %>%
  summarise(Covid_Deaths = sum(Covid_Death),
            Covid_Cases = sum(Covid_Case),
            Covid_Deaths2 = sum(Covid_Death2),
            Covid_Cases2 = sum(Covid_Case2)) %>% 
  ungroup() %>%
  summarise(Covid_Deaths = sum(Covid_Deaths),
            Covid_Cases = sum(Covid_Cases),
            Covid_Deaths2 = sum(Covid_Deaths2),
            Covid_Cases2 = sum(Covid_Cases2))

# total (all cases and deaths over all data)
covid_ger_csv %>% 
  summarise(Covid_Deaths = sum(Covid_Death),
           Covid_Cases = sum(Covid_Case),
            Covid_Deaths2 = sum(Covid_Death2),
            Covid_Cases2 = sum(Covid_Case2))

# yesterday
covid_ger_csv %>% filter(Date == as.Date("2020-12-17")) %>%
  summarise(Covid_Deaths = sum(Covid_Death),
            Covid_Deaths2 = sum(Covid_Death2))

Code for calculating from issue #31:

covid_ger_raw <- covid19germany::get_RKI_timeseries(raw_only = T) %>%
  select(Meldedatum, contains("Anzahl"), contains("Neuer")) %>%
  rename(Date = Meldedatum,
         Covid_Case = AnzahlFall,
         Covid_Death = AnzahlTodesfall,
         Covid_Case2 = NeuerFall,
         Covid_Death2 = NeuerTodesfall)

# summary per day
covid_ger_raw %>% 
  group_by(Date) %>%
  summarise(Covid_Deaths = sum(Covid_Death),
            Covid_Cases = sum(Covid_Case),
            Covid_Deaths2 = sum(Covid_Death2),
            Covid_Cases2 = sum(Covid_Case2)) %>% 
  ungroup() 

# test2
covid_ger_raw %>% 
  group_by(Date) %>%
  summarise(Covid_Deaths = sum(Covid_Death),
            Covid_Cases = sum(Covid_Case),
            Covid_Deaths2 = sum(Covid_Death2),
            Covid_Cases2 = sum(Covid_Case2)) %>% 
  ungroup() %>%
  summarise(Covid_Deaths = sum(Covid_Deaths),
            Covid_Cases = sum(Covid_Cases),
            Covid_Deaths2 = sum(Covid_Deaths2),
            Covid_Cases2 = sum(Covid_Cases2))

# total (all cases and deaths over all data)
covid_ger_raw %>% 
  summarise(Covid_Deaths = sum(Covid_Death),
           Covid_Cases = sum(Covid_Case),
            Covid_Deaths2 = sum(Covid_Death2),
            Covid_Cases2 = sum(Covid_Case2))

# yesterday
covid_ger_raw %>% filter(Date == as.Date("2020-12-17")) %>%
  summarise(Covid_Deaths = sum(Covid_Death),
            Covid_Deaths2 = sum(Covid_Death2))
wulms commented 3 years ago

Maybe it has something to do with this.

RKI Data hub

So - I do not understand logically how to calculate it :D I thought that simple "Anzahl of Fälle" would work.

AnzahlFall: Anzahl der Fälle in der entsprechenden Gruppe AnzahlTodesfall: Anzahl der Todesfälle in der entsprechenden Gruppe Meldedatum: Datum, wann der Fall dem Gesundheitsamt bekannt geworden ist Datenstand: Datum, wann der Datensatz zuletzt aktualisiert worden ist NeuerFall: 0: Fall ist in der Publikation für den aktuellen Tag und in der für den Vortag enthalten 1: Fall ist nur in der aktuellen Publikation enthalten -1: Fall ist nur in der Publikation des Vortags enthalten damit ergibt sich: Anzahl Fälle der aktuellen Publikation als Summe(AnzahlFall), wenn NeuerFall in (0,1); Delta zum Vortag als Summe(AnzahlFall) wenn NeuerFall in (-1,1) NeuerTodesfall: 0: Fall ist in der Publikation für den aktuellen Tag und in der für den Vortag jeweils ein Todesfall 1: Fall ist in der aktuellen Publikation ein Todesfall, nicht jedoch in der Publikation des Vortages -1: Fall ist in der aktuellen Publikation kein Todesfall, jedoch war er in der Publikation des Vortags ein Todesfall -9: Fall ist weder in der aktuellen Publikation noch in der des Vortages ein Todesfall damit ergibt sich: Anzahl Todesfälle der aktuellen Publikation als Summe(AnzahlTodesfall) wenn NeuerTodesfall in (0,1); Delta zum Vortag als Summe(AnzahlTodesfall) wenn NeuerTodesfall in (-1,1)

nevrome commented 3 years ago

Could you explain one more time why this is a different from #31? My conclusion back then was that we simply can't calculate the number in the dashboard:

To calculate the "Change from previous day" number the RKI shows in the dashboard and the reports we would need an additional column "Date when reported from the local health authorities to the RKI".

Maybe this conclusion is wrong.

wulms commented 3 years ago

You are right - these issues originate from the same problem in the data.

I hope the RKI would bring more clearance in their documentation about the numbers and how they calculate them from the dataset. Or simply provide a column, which contains the absolute number per day (stratified with gender, age, and landkreis).

The misleading part is, that the total number does reflect the underlying data from the sum of "Anzahl_Faelle", which does not correspond to the value where the new cases per day are from...

And that these numbers can quite easily be interpreted wrong.

covid_ger_raw %>% 
  group_by(Date) %>%
  summarise(Covid_Deaths = sum(Covid_Death),
            Covid_Cases = sum(Covid_Case)
            ) %>% 
  ungroup() %>%
  pivot_longer(Covid_Deaths:Covid_Cases, names_to = "Type", values_to = "Number") %>%
  ggplot(aes(x = Date, y = Number)) +
  geom_line() +
  facet_wrap(. ~ Type, nrow = 4, scales = "free")

Screenshot from 2020-12-18 15-14-13

I know - these are from JHU, but should show, how the ones from above can be interpreted wrong.

Screenshot from 2020-12-18 15-26-10

Screenshot from 2020-12-18 15-29-19

But again - it has nothing to do with your package.

Anyway, thanks for answering and stay healthy! Niklas

nevrome commented 3 years ago

Ja - you're right. Maybe that's something for a blog post.