Closed nevrome closed 3 years ago
Also pinging @wulms #38
Merging #47 (117a085) into master (ef027a6) will increase coverage by
0.30%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #47 +/- ##
==========================================
+ Coverage 81.51% 81.81% +0.30%
==========================================
Files 7 7
Lines 330 308 -22
==========================================
- Hits 269 252 -17
+ Misses 61 56 -5
Impacted Files | Coverage Δ | |
---|---|---|
R/estimatepast_RKI_timeseries.R | 100.00% <100.00%> (ø) |
|
R/get_RKI_timeseries.R | 97.29% <100.00%> (-0.67%) |
:arrow_down: |
R/group_RKI_timeseries.R | 93.18% <100.00%> (+17.42%) |
:arrow_up: |
R/plot_RKI_timeseries.R | 81.96% <100.00%> (+0.93%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update ef027a6...117a085. Read the comment docs.
Uff, who would have thought this is so complicated. All right, so the critical changes are in lines 134 - 163 in R/get_RKI_timeseries.R
, correct?
It might take me some time to fully understand this on my own, but I think we should. If someone else can enlighten us, that'd be welcome!
OK, I think I'm getting it to some extent. Looking at the raw data and then summing up the different cases of "NeuerFall" I find that any values other than 0 quickly fade into the past:
library(covid19germany)
rki <- get_RKI_timeseries(raw_only = T)
rki %>%
dplyr::group_by(Meldedatum) %>%
dplyr::summarise(
NeuerFallPlus = sum(NeuerFall == 1),
NeuerFallMinus = sum(NeuerFall == -1),
NeuerFallZero = sum(NeuerFall == 0),
) %>%
tail(10)
which yields
# A tibble: 10 x 4
Meldedatum NeuerFallPlus NeuerFallMinus NeuerFallZero
<dttm> <int> <int> <int>
1 2021-03-03 00:00:00 8 6 6676
2 2021-03-04 00:00:00 4 2 6171
3 2021-03-05 00:00:00 10 4 5728
4 2021-03-06 00:00:00 7 2 5066
5 2021-03-07 00:00:00 2 4 2332
6 2021-03-08 00:00:00 32 6 3078
7 2021-03-09 00:00:00 70 23 6191
8 2021-03-10 00:00:00 311 49 6995
9 2021-03-11 00:00:00 1994 51 4786
10 2021-03-12 00:00:00 3901 0 16
And I confirm that going much further back in time, non-zero cases almost vanish. This indeed suggests that cases other than zero are used by the data curators to indicate changes. So: If a new case occurs today, it is marked with 1
, but then changed to 0
tomorrow. What I have not grasped what on earth -1 means. The explanation on the RKI's webpage reads "-1: Fall ist nur in der Publikation des Vortags enthalten". But that's bizarre, of course, cause if it's only in the publication from the day before, it should simply never be there...
Anyway, I think the code change you suggest makes sense and we should adopt it. However, I suggest a different naming instead of "MovingCorrectionX". How about ChangesTestedIll
, ChangesDead
and ChangesRecovered
?
OK - Holger wrote me another useful email to explain his reasoning and I started from your code to investigate this, @stschiff.
rki <- get_RKI_timeseries(raw_only = T)
rki_daily <- rki %>%
dplyr::group_by(Meldedatum) %>%
dplyr::summarise(
plus = sum(ifelse(NeuerFall == 1, AnzahlFall, NA), na.rm = T),
minus = sum(ifelse(NeuerFall == -1, AnzahlFall, NA), na.rm = T),
zero = sum(ifelse(NeuerFall == 0, AnzahlFall, NA), na.rm = T)
)
When you run this, you will see that all entries in minus
are negative. I think they are negative transmission errors detected for past releases. plus
only holds positive values. I think they are positive transmission errors detected for past releases. If you form the sum over all entries in plus
and minus
and then add them, so
sum(rki_daily$minus) + sum(rki_daily$plus)
, you get the number of new cases the RKI can report today (not the new cases measured today). These cases are not only from today, but also from previous days. Of course you have stronger corrections for the more recent past, that's why the absolute values in minus
and plus
are the highest in the last couple of days. But some individual corrections go also back many months. I have NO idea how that could happen. :open_mouth:
Now about zero
. I now think this is the most confusing column and I have no clue what the value for the current day means. But for the past days it seems to get constantly corrected then with new minus
and plus
reports. That means it always lags behind the new information coming in, most notably so for the current day. The total sum of reported cases is therefore calculated as
sum(rki_daily$zero + rki_daily$plus)
OK, wow. Indeed, those rows with NeuerFall=-1
have negative values in AnzahlFall
. That's indeed a very important part of the puzzle. And it now explains what's going on, I think. So indeed, they somehow record a change from yesterday using the NeuerFall
entry. I don't think the zero is too confusing, though. Zero simply means that the entry in that row is stable, it hasn't changed from yesterday, i.e. it is neither reported as deleted (NeuerFall=-1
) nor as added (NeuerFall=1
).
So this really now confirms to me that this is an elegant trick from the data curators to transparently communicate changes of the table... it's like the "Track Changes" feature in Microsoft Word, where the NeuerFall
column is used to denote additions and deletions, and NeuerFall=0
indicates simply a "normal" row. So it's now also clear why the RKI suggests that all columns with NeuerFall %in% c(0,1)
indicate actual cases to be counted towards the total, and NeuerFall %in% c(-1,1)
denote the changes from yesterday, even if those changes are recorded in previous dates. It all makes sense now!
So, in light of these insights, I would still think that the term "correction" is a bit misleading here. Yes, in some cases it's a correction, but a very important case here are new cases recorded today, which are always marked using NeuerFall=1
. I wouldn't call this a "correction", but simply a "change from yesterday". So I would prefer the term "change" over "correction"... although I don't feel strong about it, and you're the primary author here, @nevrome, so I leave the decision to you.
I'm not completely through with this, but I want to merge now to prevent a merge conflict with other necessary changes.
I got an email from Holger Schluckebier who suggested a simpler and still more complete implementation of the data processing in
get_RKI_timeseries()
. Please see the changes in this function - all other changes in this PR are only minor downstream consequences of this top-level change.As discussed in #31, the
Change from previous day
number reported by the RKI was so far not equal to the last entry in our columnNumberNewTestedIll
. That's because until now we only counted the cases whereNeuerFall
/NeuerTodesfall
/NeuGenesen
are either0
or1
(remember the confusing setup of the raw dataset).Holger now suggested to count the "correction cases" where
NeuerFall
/NeuerTodesfall
/NeuGenesen
are either-1
or1
in separate columns. I called themMovingCorrection...
. If one then forms the cumulative sum over these columns, one does indeed arrive at the "correct" number of new cases as displayed in the dashboard.CumMovingCorrectionTestedIll for 2021-03-11: 12834 CumMovingCorrectionDead for 2021-03-11: 252
Unfortunately I have to admit, that I still do not understand how exactly this magic works.
Maybe some of you who previously reported issues in this context (@slawomirmatuszak, @stschiff) could take a quick look at this PR. I would also like to break these columns down a little bit for the README and I'm not sure yet, how to tackle this.
MovingCorrection...
might not be a good name and maybe it's not even necessary to drop these columns in the main output dataset.