Closed zkamvar closed 6 years ago
For the ISO
part, I am wondering whether it is necessary for extending iso
to "month", "quarter", and "year". I think those intervals of "month", "quarter", and "year" have nothing with iso
.
What's the difference between the following two monthly incidence? Could you describe a usage situation?
incidence(onset, "month")
incidence(onset, "month", iso = FALSE)
Could you print out the summary of the following examples ?
incidence(onset, interval = 7L)
incidence(onset, interval = "ISOweek")
For this ISO part, I am wondering whether it is necessary for extending iso to "month", "quarter", and "year". I think those intervals of "month", "quarter", and "year" have nothing with iso.
The iso = TRUE
parameter is a way to force the bins to conform to a recognized standard. If the user sets iso = TRUE
, they are indicating that the bins should start at an internationally recognized beginning of that given interval. It's relatively simple for months, quarters, and years, but a bit more difficult for weeks, which is why you added the option in the first place.
Are you suggesting to keep ISO specific to weeks or are you suggesting to get rid of it all together and have the user specify "ISO{week,month,quarter,year}"
in the interval? I would be okay with getting rid of the iso
argument in favor of having the user giving more specific intervals.
What's the difference between the following two monthly incidence? Could you describe a usage situation?
Yes, if the user switches iso = FALSE
, then the month starts from the first date observed and recycled from there. You can see that the bins are shifted by six days in the iso vs non-iso version:
# prepare the data
library(incidence)
library(outbreaks)
onset <- ebola_sim$linelist$date_of_onset
print(mo <- incidence(onset, "month"))
#> <incidence object>
#> [5888 cases from days 2014-04-01 to 2015-04-01]
#>
#> $counts: matrix with 13 rows and 1 columns
#> $n: 5888 cases in total
#> $dates: 13 dates marking the left-side of bins
#> $interval: 1 month
#> $timespan: 366 days
#> $cumulative: FALSE
print(mo.iso <- incidence(onset, "month", iso = FALSE))
#> <incidence object>
#> [5888 cases from days 2014-04-07 to 2015-04-07]
#>
#> $counts: matrix with 13 rows and 1 columns
#> $n: 5888 cases in total
#> $dates: 13 dates marking the left-side of bins
#> $interval: 1 month
#> $timespan: 366 days
#> $cumulative: FALSE
get_dates(mo) - get_dates(mo.iso)
#> Time differences in days
#> [1] -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6
Created on 2018-07-17 by the reprex package (v0.2.0).
Also, I haven't yet implemented "ISOweek"
More importantly, the differences between setting iso = TRUE
and iso = FALSE
affect the shape of the epi curve:
# prepare the data
library(incidence)
library(outbreaks)
onset <- ebola_sim$linelist$date_of_onset
mo <- incidence(onset, "month")
mo.iso <- incidence(onset, "month", iso = FALSE)
plot(mo) + ggplot2::ylim(c(0, 1275))
plot(mo.iso) + ggplot2::ylim(c(0, 1275))
Created on 2018-07-17 by the reprex package (v0.2.0).
Now I see the difference. In essential, this is the same problem as calculating weekly incidence using ISOweek or the week/7 day interval with specifying starting date.
In the simulated dataset, the first case occurred on 2014-04-07
. We have two ways to calculate monthly incidence:
2014-04-01
, according to the first day of the calendar month, which corresponds to the first example.mo <- incidence(onset, "month")
mo.iso <- incidence(onset, "month", iso = FALSE)
For the method 2, I recommend to adopt the behavior of seq.Date()
while generating sequence of dates by month
> seq.Date(as.Date("2018-01-15"), as.Date("2018-7-15"), by = "month")
[1] "2018-01-15" "2018-02-15" "2018-03-15" "2018-04-15" "2018-05-15" "2018-06-15" "2018-07-15"
First get the date range of the reported cases, then generate the dates marking the left side of the monthly bins. Finally count cases based on those bins. I just realize that you actually implemented in this way.
Since now you have added the parameter first_date
, we could distinguish these two types of month
based on the first_date
instead of the iso
parameter, which is consistent with the case of weekly incidence. We can retain two special values for the first_date
parameter, such as FDC (first day of the calendar)
and DFC (date that the first case occurred)
(you guys can think of better names), corresponding to methods 1 and 2, respectively. And FDC
can be the default value. Same behaviors apply to "quarter" and "year".
The advantage of doing so is easy to understand conceptually. At the first, I am confused by why the "month", "quarter", and "year" incidence have been relevant to the iso
parameter. Moreover, it is easier to write documentations.
Since now you have added the parameter first_date, we could distinguish these two types of month based on the first_date instead of the iso parameter, which is consistent with the case of weekly incidence. We can retain two special values for the first_date parameter, such as FDC (first day of the calendar) and DFC (date that the first case occurred) (you guys can think of better names), corresponding to methods 1 and 2, respectively. And FDC can be the default value. Same behaviors apply to "quarter" and "year".
The advantage of doing so is easy to understand conceptually. At the first, I am confused by why the "month", "quarter", and "year" incidence have been relevant to the iso parameter. Moreover, it is easier to write documentations.
You bring up a good point here and this highlights a kind of "too many switches" problem where we currently have two questions (How wide do you want the interval to be?, Where should the interval start?) and three options (interval
, iso
, and first_date
).
I think getting rid of the iso
argument is a good idea, but we should figure out a default configuration that makes sense for text-based intervals and an easy way to switch them off (which @caijun already alluded to with the first_date
argument).
On second thought, it becomes a problem for users to specify that they want the default start date to be the first date of the cases, so a single logical flag to switch start dates between first calendar date and first case makes sense in addition to the first_date
parameter for control of where the interval starts.
It makes sense because if we have three, simple, parameters, then it reduces the cognitive load for the user. With two parameters that can take various inputs, then the user needs to look up appropriate values for either first_date
(e.g. FDC, DFC, or a date) or interval
(week vs ISOweek), depending on the implementation we use.
Given that @caijun takes issue with the iso
argument being used for months, quarters, and years, perhaps it would be better to name the flag something like canonical = TRUE
and specify in the documentation that this will set the start date to be the first calendar date of the interval.
I have also ensured that subset.incidence()
works with character intervals.
This PR contains quite a few changes and I hope I can accurately summarize them here. Let me know what you think of the changes and suggest things that still need to be done.
edit 2018-07-18: Changed
iso
tostandard
Things that still need to be done:
scales
are not that great :/standard
and the usage of text-based intervals.New Functions
the new functions
get_dates()
andget_interval()
will close #54, close #48, and close #49.get_interval()
will return a vector of length one or equal to the number of bins depending on whether or not the interval is fixed.get_dates()
can return the dates on either the right, left, or center of the interval in dates or numerics.Both of these functions are essential for the functionality of text-based intervals.
Example
Created on 2018-07-17 by the reprex package (v0.2.0).
Big Changes
iso_weeks to standard
iso_weeks
has been replaced bystandard
. It does exactly the same thing, but now is extended to "month", "quarter", and "year". If the user wants to customize the start date, they can turn this off. This will close #55 and close #53Created on 2018-07-17 by the reprex package (v0.2.0).
Custom first date
This was something that Thibaut originally wanted to implement, but forgot to push the branch :no_good_man: This allows the user to specify a custom
first_date
that can be before the cases start or a date from within the data in which the dates will be trimmed and a warning will be generated for the user.This closes #56
Created on 2018-07-17 by the reprex package (v0.2.0).
Markdown docs
One more thing I did was convert the docs to markdown using https://github.com/r-lib/roxygen2md.