reconhub / incidence

☣:chart_with_upwards_trend::chart_with_downwards_trend:☣ Compute and visualise incidence
https://reconhub.github.io/incidence
Other
58 stars 13 forks source link

Full support for text-based intervals #57

Closed zkamvar closed 6 years ago

zkamvar commented 6 years ago

This PR contains quite a few changes and I hope I can accurately summarize them here. Let me know what you think of the changes and suggest things that still need to be done.

edit 2018-07-18: Changed iso to standard

Things that still need to be done:

New Functions

the new functions get_dates() and get_interval() will close #54, close #48, and close #49. get_interval() will return a vector of length one or equal to the number of bins depending on whether or not the interval is fixed. get_dates() can return the dates on either the right, left, or center of the interval in dates or numerics.

Both of these functions are essential for the functionality of text-based intervals.

Example

library(incidence)
library(outbreaks)
onset <- ebola_sim$linelist$date_of_onset

inc <- incidence(onset, "month")
get_dates(inc)
#>  [1] "2014-04-01" "2014-05-01" "2014-06-01" "2014-07-01" "2014-08-01"
#>  [6] "2014-09-01" "2014-10-01" "2014-11-01" "2014-12-01" "2015-01-01"
#> [11] "2015-02-01" "2015-03-01" "2015-04-01"
get_dates(inc, "center")
#>  [1] "2014-04-16" "2014-05-16" "2014-06-16" "2014-07-16" "2014-08-16"
#>  [6] "2014-09-16" "2014-10-16" "2014-11-16" "2014-12-16" "2015-01-16"
#> [11] "2015-02-15" "2015-03-16" "2015-04-16"
get_dates(inc, "center", count_days = TRUE)
#>  [1]  15.0  45.5  76.0 106.5 137.5 168.0 198.5 229.0 259.5 290.5 320.0
#> [12] 349.5 380.0
get_interval(inc)
#>  [1] 30 31 30 31 31 30 31 30 31 31 28 31 30
get_interval(inc, integer = FALSE)
#> [1] "month"

Created on 2018-07-17 by the reprex package (v0.2.0).

Big Changes

iso_weeks to standard

iso_weeks has been replaced by standard. It does exactly the same thing, but now is extended to "month", "quarter", and "year". If the user wants to customize the start date, they can turn this off. This will close #55 and close #53

# prepare the data
library(incidence)
library(outbreaks)
onset <- ebola_sim$linelist$date_of_onset

incidence(onset, "month")
#> <incidence object>
#> [5888 cases from days 2014-04-01 to 2015-04-01]
#> 
#> $counts: matrix with 13 rows and 1 columns
#> $n: 5888 cases in total
#> $dates: 13 dates marking the left-side of bins
#> $interval: 1 month
#> $timespan: 366 days
#> $cumulative: FALSE
incidence(onset, "month", standard = FALSE)
#> <incidence object>
#> [5888 cases from days 2014-04-07 to 2015-04-07]
#> 
#> $counts: matrix with 13 rows and 1 columns
#> $n: 5888 cases in total
#> $dates: 13 dates marking the left-side of bins
#> $interval: 1 month
#> $timespan: 366 days
#> $cumulative: FALSE

# demonstrating quarter
incidence(onset - 30, "quarter")
#> <incidence object>
#> [5888 cases from days 2014-01-01 to 2015-01-01]
#> 
#> $counts: matrix with 5 rows and 1 columns
#> $n: 5888 cases in total
#> $dates: 5 dates marking the left-side of bins
#> $interval: 1 quarter
#> $timespan: 366 days
#> $cumulative: FALSE
incidence(onset - 30, "quarter", standard = FALSE)
#> <incidence object>
#> [5888 cases from days 2014-03-08 to 2015-03-08]
#> 
#> $counts: matrix with 5 rows and 1 columns
#> $n: 5888 cases in total
#> $dates: 5 dates marking the left-side of bins
#> $interval: 1 quarter
#> $timespan: 366 days
#> $cumulative: FALSE

Created on 2018-07-17 by the reprex package (v0.2.0).

Custom first date

This was something that Thibaut originally wanted to implement, but forgot to push the branch :no_good_man: This allows the user to specify a custom first_date that can be before the cases start or a date from within the data in which the dates will be trimmed and a warning will be generated for the user.

This closes #56

# prepare the data
library(incidence)
library(outbreaks)
onset <- ebola_sim$linelist$date_of_onset

plot(incidence(onset, "month", first_date = "2014-03-15"), border = "white")

plot(incidence(onset, "month", first_date = "2014-04-15"), border = "white")
#> Warning: I removed 1 observations outside of [2014-04-15, 2015-04-30].

Created on 2018-07-17 by the reprex package (v0.2.0).

Markdown docs

One more thing I did was convert the docs to markdown using https://github.com/r-lib/roxygen2md.

caijun commented 6 years ago

For the ISO part, I am wondering whether it is necessary for extending iso to "month", "quarter", and "year". I think those intervals of "month", "quarter", and "year" have nothing with iso.

What's the difference between the following two monthly incidence? Could you describe a usage situation?

incidence(onset, "month")
incidence(onset, "month", iso = FALSE)

Could you print out the summary of the following examples ?

incidence(onset, interval = 7L)
incidence(onset, interval = "ISOweek")
zkamvar commented 6 years ago

For this ISO part, I am wondering whether it is necessary for extending iso to "month", "quarter", and "year". I think those intervals of "month", "quarter", and "year" have nothing with iso.

The iso = TRUE parameter is a way to force the bins to conform to a recognized standard. If the user sets iso = TRUE, they are indicating that the bins should start at an internationally recognized beginning of that given interval. It's relatively simple for months, quarters, and years, but a bit more difficult for weeks, which is why you added the option in the first place.

Are you suggesting to keep ISO specific to weeks or are you suggesting to get rid of it all together and have the user specify "ISO{week,month,quarter,year}" in the interval? I would be okay with getting rid of the iso argument in favor of having the user giving more specific intervals.

What's the difference between the following two monthly incidence? Could you describe a usage situation?

Yes, if the user switches iso = FALSE, then the month starts from the first date observed and recycled from there. You can see that the bins are shifted by six days in the iso vs non-iso version:

# prepare the data
library(incidence)
library(outbreaks)
onset <- ebola_sim$linelist$date_of_onset

print(mo <- incidence(onset, "month"))
#> <incidence object>
#> [5888 cases from days 2014-04-01 to 2015-04-01]
#> 
#> $counts: matrix with 13 rows and 1 columns
#> $n: 5888 cases in total
#> $dates: 13 dates marking the left-side of bins
#> $interval: 1 month
#> $timespan: 366 days
#> $cumulative: FALSE
print(mo.iso <- incidence(onset, "month", iso = FALSE))
#> <incidence object>
#> [5888 cases from days 2014-04-07 to 2015-04-07]
#> 
#> $counts: matrix with 13 rows and 1 columns
#> $n: 5888 cases in total
#> $dates: 13 dates marking the left-side of bins
#> $interval: 1 month
#> $timespan: 366 days
#> $cumulative: FALSE
get_dates(mo) - get_dates(mo.iso)
#> Time differences in days
#>  [1] -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6

Created on 2018-07-17 by the reprex package (v0.2.0).

zkamvar commented 6 years ago

Also, I haven't yet implemented "ISOweek"

zkamvar commented 6 years ago

More importantly, the differences between setting iso = TRUE and iso = FALSE affect the shape of the epi curve:

# prepare the data
library(incidence)
library(outbreaks)
onset <- ebola_sim$linelist$date_of_onset

mo <- incidence(onset, "month")
mo.iso <- incidence(onset, "month", iso = FALSE)

plot(mo) + ggplot2::ylim(c(0, 1275))

plot(mo.iso) + ggplot2::ylim(c(0, 1275))

Created on 2018-07-17 by the reprex package (v0.2.0).

caijun commented 6 years ago

Now I see the difference. In essential, this is the same problem as calculating weekly incidence using ISOweek or the week/7 day interval with specifying starting date.

In the simulated dataset, the first case occurred on 2014-04-07. We have two ways to calculate monthly incidence:

  1. count cases starting from 2014-04-01, according to the first day of the calendar month, which corresponds to the first example.
mo <- incidence(onset, "month")
  1. count cases starting the date that the first case occurred, usually not the first day of the month, as in the second example. In this example, what the length of days for each month? 30 or 31 days?
mo.iso <- incidence(onset, "month", iso = FALSE)

For the method 2, I recommend to adopt the behavior of seq.Date() while generating sequence of dates by month

> seq.Date(as.Date("2018-01-15"), as.Date("2018-7-15"), by = "month")
[1] "2018-01-15" "2018-02-15" "2018-03-15" "2018-04-15" "2018-05-15" "2018-06-15" "2018-07-15"

First get the date range of the reported cases, then generate the dates marking the left side of the monthly bins. Finally count cases based on those bins. I just realize that you actually implemented in this way.

Since now you have added the parameter first_date, we could distinguish these two types of month based on the first_date instead of the iso parameter, which is consistent with the case of weekly incidence. We can retain two special values for the first_date parameter, such as FDC (first day of the calendar) and DFC (date that the first case occurred) (you guys can think of better names), corresponding to methods 1 and 2, respectively. And FDC can be the default value. Same behaviors apply to "quarter" and "year".

The advantage of doing so is easy to understand conceptually. At the first, I am confused by why the "month", "quarter", and "year" incidence have been relevant to the iso parameter. Moreover, it is easier to write documentations.

zkamvar commented 6 years ago

Since now you have added the parameter first_date, we could distinguish these two types of month based on the first_date instead of the iso parameter, which is consistent with the case of weekly incidence. We can retain two special values for the first_date parameter, such as FDC (first day of the calendar) and DFC (date that the first case occurred) (you guys can think of better names), corresponding to methods 1 and 2, respectively. And FDC can be the default value. Same behaviors apply to "quarter" and "year".

The advantage of doing so is easy to understand conceptually. At the first, I am confused by why the "month", "quarter", and "year" incidence have been relevant to the iso parameter. Moreover, it is easier to write documentations.

You bring up a good point here and this highlights a kind of "too many switches" problem where we currently have two questions (How wide do you want the interval to be?, Where should the interval start?) and three options (interval, iso, and first_date).

I think getting rid of the iso argument is a good idea, but we should figure out a default configuration that makes sense for text-based intervals and an easy way to switch them off (which @caijun already alluded to with the first_date argument).

zkamvar commented 6 years ago

On second thought, it becomes a problem for users to specify that they want the default start date to be the first date of the cases, so a single logical flag to switch start dates between first calendar date and first case makes sense in addition to the first_date parameter for control of where the interval starts.

It makes sense because if we have three, simple, parameters, then it reduces the cognitive load for the user. With two parameters that can take various inputs, then the user needs to look up appropriate values for either first_date (e.g. FDC, DFC, or a date) or interval (week vs ISOweek), depending on the implementation we use.

Given that @caijun takes issue with the iso argument being used for months, quarters, and years, perhaps it would be better to name the flag something like canonical = TRUE and specify in the documentation that this will set the start date to be the first calendar date of the interval.

zkamvar commented 6 years ago

I have also ensured that subset.incidence() works with character intervals.