nevrome / covid19germany

R package - Load, visualise and analyse daily updated data on the COVID-19 outbreak in Germany
Other
46 stars 8 forks source link

Data reading fix #47

Closed nevrome closed 3 years ago

nevrome commented 3 years ago

I got an email from Holger Schluckebier who suggested a simpler and still more complete implementation of the data processing in get_RKI_timeseries(). Please see the changes in this function - all other changes in this PR are only minor downstream consequences of this top-level change.

As discussed in #31, the Change from previous day number reported by the RKI was so far not equal to the last entry in our column NumberNewTestedIll. That's because until now we only counted the cases where NeuerFall/NeuerTodesfall/NeuGenesen are either 0or 1 (remember the confusing setup of the raw dataset).

Holger now suggested to count the "correction cases" where NeuerFall/NeuerTodesfall/NeuGenesen are either -1or 1 in separate columns. I called them MovingCorrection.... If one then forms the cumulative sum over these columns, one does indeed arrive at the "correct" number of new cases as displayed in the dashboard.

library(covid19germany)
rki <- get_RKI_timeseries()
rki %>% group_RKI_timeseries() %>% tail
|Date       | NumberNewTestedIll| NumberNewDead| NumberNewRecovered| MovingCorrectionTestedIll| MovingCorrectionDead| MovingCorrectionRecovered| CumNumberTestedIll| CumNumberDead| CumNumberRecovered| CumMovingCorrectionTestedIll| CumMovingCorrectionDead| CumMovingCorrectionRecovered|
|:----------|------------------:|-------------:|------------------:|-------------------------:|--------------------:|-------------------------:|------------------:|-------------:|------------------:|----------------------------:|-----------------------:|----------------------------:|
|2021-03-06 |               8309|            24|                217|                         3|                    4|                        79|            2503536|         72990|            2345157|                           50|                     224|                         8438|
|2021-03-07 |               3374|            20|                 67|                         9|                    7|                        22|            2506910|         73010|            2345224|                           59|                     231|                         8460|
|2021-03-08 |               4790|            23|                 68|                        39|                    8|                        16|            2511700|         73033|            2345292|                           98|                     239|                         8476|
|2021-03-09 |              11388|            16|                144|                       251|                    6|                        42|            2523088|         73049|            2345436|                          349|                     245|                         8518|
|2021-03-10 |              13556|             8|                118|                      3384|                    2|                        51|            2536644|         73057|            2345554|                         3733|                     247|                         8569|
|2021-03-11 |               9137|             5|                 45|                      9101|                    5|                        42|            2545781|         73062|            2345599|                        12834|                     252|                         8611|

CumMovingCorrectionTestedIll for 2021-03-11: 12834 CumMovingCorrectionDead for 2021-03-11: 252

image

Unfortunately I have to admit, that I still do not understand how exactly this magic works.

Maybe some of you who previously reported issues in this context (@slawomirmatuszak, @stschiff) could take a quick look at this PR. I would also like to break these columns down a little bit for the README and I'm not sure yet, how to tackle this. MovingCorrection... might not be a good name and maybe it's not even necessary to drop these columns in the main output dataset.

nevrome commented 3 years ago

Also pinging @wulms #38

codecov-io commented 3 years ago

Codecov Report

Merging #47 (117a085) into master (ef027a6) will increase coverage by 0.30%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #47      +/-   ##
==========================================
+ Coverage   81.51%   81.81%   +0.30%     
==========================================
  Files           7        7              
  Lines         330      308      -22     
==========================================
- Hits          269      252      -17     
+ Misses         61       56       -5     
Impacted Files Coverage Δ
R/estimatepast_RKI_timeseries.R 100.00% <100.00%> (ø)
R/get_RKI_timeseries.R 97.29% <100.00%> (-0.67%) :arrow_down:
R/group_RKI_timeseries.R 93.18% <100.00%> (+17.42%) :arrow_up:
R/plot_RKI_timeseries.R 81.96% <100.00%> (+0.93%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ef027a6...117a085. Read the comment docs.

stschiff commented 3 years ago

Uff, who would have thought this is so complicated. All right, so the critical changes are in lines 134 - 163 in R/get_RKI_timeseries.R, correct?

It might take me some time to fully understand this on my own, but I think we should. If someone else can enlighten us, that'd be welcome!

stschiff commented 3 years ago

OK, I think I'm getting it to some extent. Looking at the raw data and then summing up the different cases of "NeuerFall" I find that any values other than 0 quickly fade into the past:

library(covid19germany)

rki <- get_RKI_timeseries(raw_only = T)

rki %>%
  dplyr::group_by(Meldedatum) %>%
  dplyr::summarise(
    NeuerFallPlus = sum(NeuerFall == 1),
    NeuerFallMinus = sum(NeuerFall == -1),
    NeuerFallZero = sum(NeuerFall == 0),
  ) %>%
  tail(10)

which yields

# A tibble: 10 x 4
   Meldedatum          NeuerFallPlus NeuerFallMinus NeuerFallZero
   <dttm>                      <int>          <int>         <int>
 1 2021-03-03 00:00:00             8              6          6676
 2 2021-03-04 00:00:00             4              2          6171
 3 2021-03-05 00:00:00            10              4          5728
 4 2021-03-06 00:00:00             7              2          5066
 5 2021-03-07 00:00:00             2              4          2332
 6 2021-03-08 00:00:00            32              6          3078
 7 2021-03-09 00:00:00            70             23          6191
 8 2021-03-10 00:00:00           311             49          6995
 9 2021-03-11 00:00:00          1994             51          4786
10 2021-03-12 00:00:00          3901              0            16

And I confirm that going much further back in time, non-zero cases almost vanish. This indeed suggests that cases other than zero are used by the data curators to indicate changes. So: If a new case occurs today, it is marked with 1, but then changed to 0 tomorrow. What I have not grasped what on earth -1 means. The explanation on the RKI's webpage reads "-1: Fall ist nur in der Publikation des Vortags enthalten". But that's bizarre, of course, cause if it's only in the publication from the day before, it should simply never be there...

Anyway, I think the code change you suggest makes sense and we should adopt it. However, I suggest a different naming instead of "MovingCorrectionX". How about ChangesTestedIll, ChangesDead and ChangesRecovered?

nevrome commented 3 years ago

OK - Holger wrote me another useful email to explain his reasoning and I started from your code to investigate this, @stschiff.

rki <- get_RKI_timeseries(raw_only = T)

rki_daily <- rki %>%
  dplyr::group_by(Meldedatum) %>%
  dplyr::summarise(
    plus = sum(ifelse(NeuerFall == 1, AnzahlFall, NA), na.rm = T),
    minus = sum(ifelse(NeuerFall == -1, AnzahlFall, NA), na.rm = T),
    zero = sum(ifelse(NeuerFall == 0, AnzahlFall, NA), na.rm = T)
  )

When you run this, you will see that all entries in minus are negative. I think they are negative transmission errors detected for past releases. plus only holds positive values. I think they are positive transmission errors detected for past releases. If you form the sum over all entries in plus and minus and then add them, so

sum(rki_daily$minus) + sum(rki_daily$plus)

, you get the number of new cases the RKI can report today (not the new cases measured today). These cases are not only from today, but also from previous days. Of course you have stronger corrections for the more recent past, that's why the absolute values in minus and plus are the highest in the last couple of days. But some individual corrections go also back many months. I have NO idea how that could happen. :open_mouth:

Now about zero. I now think this is the most confusing column and I have no clue what the value for the current day means. But for the past days it seems to get constantly corrected then with new minus and plus reports. That means it always lags behind the new information coming in, most notably so for the current day. The total sum of reported cases is therefore calculated as

sum(rki_daily$zero + rki_daily$plus)
stschiff commented 3 years ago

OK, wow. Indeed, those rows with NeuerFall=-1 have negative values in AnzahlFall. That's indeed a very important part of the puzzle. And it now explains what's going on, I think. So indeed, they somehow record a change from yesterday using the NeuerFall entry. I don't think the zero is too confusing, though. Zero simply means that the entry in that row is stable, it hasn't changed from yesterday, i.e. it is neither reported as deleted (NeuerFall=-1) nor as added (NeuerFall=1).

So this really now confirms to me that this is an elegant trick from the data curators to transparently communicate changes of the table... it's like the "Track Changes" feature in Microsoft Word, where the NeuerFall column is used to denote additions and deletions, and NeuerFall=0 indicates simply a "normal" row. So it's now also clear why the RKI suggests that all columns with NeuerFall %in% c(0,1) indicate actual cases to be counted towards the total, and NeuerFall %in% c(-1,1) denote the changes from yesterday, even if those changes are recorded in previous dates. It all makes sense now!

So, in light of these insights, I would still think that the term "correction" is a bit misleading here. Yes, in some cases it's a correction, but a very important case here are new cases recorded today, which are always marked using NeuerFall=1. I wouldn't call this a "correction", but simply a "change from yesterday". So I would prefer the term "change" over "correction"... although I don't feel strong about it, and you're the primary author here, @nevrome, so I leave the decision to you.

nevrome commented 3 years ago

I'm not completely through with this, but I want to merge now to prevent a merge conflict with other necessary changes.