Separate splitting dates for each group when fit_optim_split() takes groups into account?

caijun commented 6 years ago

I am preparing the worked example for the incidence paper. The following code reproduces the problem.

# prepare the data
library(outbreaks)

girardot <- data.frame(date = rep(zika_girardot_2015$date, zika_girardot_2015$cases))
girardot$setting <- "Girardot"
sanandres <- data.frame(date = rep(zika_sanandres_2015$date, zika_sanandres_2015$cases))
sanandres$setting <- "San Andres"
zika_colombia_2015 <- rbind(girardot, sanandres)

# weekly incidence by setting
library(incidence)
i.7.setting <- incidence(zika_colombia_2015$date, interval = 7, 
                         groups = zika_colombia_2015$setting)
plot(i.7.setting, stack = FALSE, border = "white")

# fit the log-linear regression models by setting
best.fit2 <- fit_optim_split(i.7.setting)$fit
best.fit2
plot(i.7.setting, fit=best.fit2, border="white")

The resulting plot is as follows

I think the problem occurred because the epidemic in San Andres was earlier than in Girardot. In this case, fit_optim_split() should separately return splitting dates for each group.

zkamvar commented 6 years ago

In this case, fit_optim_split() should separately return splitting dates for each group.

Agreed. The only catch here is that we will have to construct the models separately for each group instead of considering the interactions between group and date, but that shouldn't be too much of a problem :)

thibautjombart commented 6 years ago

I agree about the separate models, but I think we may want to keep the separate split points optional (possibly defaulting to TRUE); the advantage of have a model with interactions is that it provides a formal test for differences amongst groups.

thibautjombart commented 6 years ago

Another thought: I wonder if we want the same behavior in these two distinct cases:

model(s) for 2 groups, but no split; in this case I would think the single model with interactions makes most sense by default
model(s) for 2 groups with a split; in this case different splitting dates for the groups would make sense by default, so effectively separate models

But that would result in different default behaviors; happy to hear your thoughts about it :)

caijun commented 6 years ago

@thibautjombart could you explain a bit more on the single model with interactions for 1)?

thibautjombart commented 6 years ago

Sure. Approach 1 is currently implemented: incidence ~ time * group + intercept*group so that groups can have different slopes and intercepts (* denote additive effects and interactions)

caijun commented 6 years ago

I see. This model requires the observations from different groups are collected at the same time and of the same length? Does it apply to the example in which the epidemic in San Andres occurred earlier than in Girardot?

zkamvar commented 6 years ago

I agree about the separate models, but I think we may want to keep the separate split points optional (possibly defaulting to TRUE); the advantage of have a model with interactions is that it provides a formal test for differences amongst groups.

Agreed.

Another thought: I wonder if we want the same behavior in these two distinct cases

When you say 'behavior', do you mean behavior in the object returned, the plotting, or something else?

zkamvar commented 6 years ago

@thibautjombart, regarding the option, which one do you think is a better name?

separate_split
independent_split
separate_groups
separate_models

thibautjombart commented 6 years ago

independent_split ? Or separate_split

caijun commented 6 years ago

separate_split👍

zkamvar commented 6 years ago

In ce79a054d478702c449ebf110e4cdc9ac7c4caad, I've added a new incidence_fit_list class, which should help alleviate some of the headaches we are running into with incidence_fit objects in nested lists. Also, they print pretty now :)

zkamvar commented 6 years ago

Okay, here's a rough working example:

library(outbreaks)

girardot <- data.frame(date = rep(zika_girardot_2015$date, zika_girardot_2015$cases))
girardot$setting <- "Girardot"
sanandres <- data.frame(date = rep(zika_sanandres_2015$date, zika_sanandres_2015$cases))
sanandres$setting <- "San Andres"
zika_colombia_2015 <- rbind(girardot, sanandres)

# weekly incidence by setting
library(incidence)
i.7.setting <- incidence(zika_colombia_2015$date, interval = 7, 
                         groups = zika_colombia_2015$setting)
# fit the log-linear regression models by setting
best.fit2 <- fit_optim_split(i.7.setting)
best.fit2
#> <list of incidence_fit objects>
#> 
#> attr(x, 'locations'): list of vectors with the locations of each incidence_fit object
#> 
#> 'Girardot', 'fit', 'before'
#> 'San Andres', 'fit', 'before'
#> 'Girardot', 'fit', 'after'
#> 'San Andres', 'fit', 'after'
#> 
#> $lm: regression of log-incidence over time
#> 
#> $info: list containing the following items:
#>   $r (daily growth rate):
#>   Girardot_fit_before San Andres_fit_before    Girardot_fit_after 
#>            0.16462990            0.08074181           -0.04155200 
#>  San Andres_fit_after 
#>           -0.06384633 
#> 
#>   $r.conf (confidence interval):
#>                             2.5 %      97.5 %
#> Girardot_fit_before    0.04550440  0.28375540
#> San Andres_fit_before  0.05653757  0.10494606
#> Girardot_fit_after    -0.05768023 -0.02542377
#> San Andres_fit_after  -0.08352297 -0.04416970
#> 
#>   $doubling (doubling time in days):
#>   Girardot_fit_before San Andres_fit_before    Girardot_fit_after 
#>              4.210336              8.584736                    NA 
#>  San Andres_fit_after 
#>                    NA 
#> 
#>   $doubling.conf (confidence interval):
#>                          2.5 %   97.5 %
#> Girardot_fit_before   2.442763 15.23253
#> San Andres_fit_before 6.604795 12.25994
#> Girardot_fit_after          NA       NA
#> San Andres_fit_after        NA       NA
#> 
#>   $halving (halving time in days):
#>   Girardot_fit_before San Andres_fit_before    Girardot_fit_after 
#>                    NA                    NA              16.68144 
#>  San Andres_fit_after 
#>              10.85649 
#> 
#>   $halving.conf (confidence interval):
#>                           2.5 %   97.5 %
#> Girardot_fit_before          NA       NA
#> San Andres_fit_before        NA       NA
#> Girardot_fit_after    12.017066 27.26375
#> San Andres_fit_after   8.298881 15.69282
#> 
#>   $pred: data.frame of incidence predictions (36 rows, 7 columns)
library(magrittr)
p <- plot(i.7.setting, border="white", stack = FALSE)
p %>% add_incidence_fit(fit_optim_split(i.7.setting))
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.

Created on 2018-07-19 by the reprex package (v0.2.0).

caijun commented 6 years ago

Cool, this is what I expected. A little bit to discuss,

> Girardot_fit_before San Andres_fit_before Girardot_fit_after

> 0.16462990 0.08074181 -0.04155200

> San Andres_fit_after

> -0.06384633

Is it better to show the results by groups rather than by before and after?

> $doubling (doubling time in days):

> Girardot_fit_before San Andres_fit_before Girardot_fit_after

> 4.210336 8.584736 NA

> San Andres_fit_after

> NA

Do you think is it necessary to show the NA values for the after while estimating doubling time? Same question for the NA values for the before while estimating halving time.

In my opinion, the printing information should be succinct and avoid redundancy.

zkamvar commented 6 years ago

I think I agree with both of your points. At the moment, the printing is controlled by the print method for the incidence_fit_list, which aggregates the data from all the incidence_fit objects contained within. The upside is that you can put as many incidence_fit objects anywhere within a list, and they will print like that. The downside is that it's agnostic to how the objects are related.

At the moment, I'm running fit_optim_split() on each group separately and then adding the attributes and class for the incidence_fit_list:

https://github.com/reconhub/incidence/blob/518c6084b1601e597ff9ee9df754433eb070712d/R/fit_optim_split.R#L18-L33

However, this hides the other information present in the output of fit_optim_split() and makes switching between contexts difficult (plot(res) vs plot(res$fit) for split and non-split contexts).

Currently, the structure looks like this kafkaesque nightmare:

$res <incidence_fit_list>
    $group1
        $df
        $plot
        $split
        $fit <incidence_fit_list>
            $before <incidence_fit>
            $after  <incidence_fit>
    $group2
        $df
        $plot
        $split
        $fit  <incidence_fit_list>
            $before <incidence_fit>
            $after  <incidence_fit>

Instead, I'm thinking about rearranging the result like so:

$res
    $df
    $plot
        $group1
        $group2
    $split
    $fit <incidence_fit_list>
        $group1 <incidence_fit_list>
            $before <incidence_fit>
            $after  <incidence_fit>
        $group2 <incidence_fit_list>
            $before <incidence_fit>
            $after  <incidence_fit>

and enabling plot.incidence_fit_list() to detect if there are other incidence_fit_list objects nested and use that to inform how to plot the results.

caijun commented 6 years ago

The second result arrangement is more compact. 👍

zkamvar commented 6 years ago

Hi @caijun,

So I've made a PR (#59) that you can review. Here's the rendered version of your example:

library(outbreaks)

girardot <- data.frame(date = rep(zika_girardot_2015$date, zika_girardot_2015$cases))
girardot$setting <- "Girardot"
sanandres <- data.frame(date = rep(zika_sanandres_2015$date, zika_sanandres_2015$cases))
sanandres$setting <- "San Andres"
zika_colombia_2015 <- rbind(girardot, sanandres)

# weekly incidence by setting
library(incidence)
i.7.setting <- incidence(zika_colombia_2015$date, interval = 7, 
                         groups = zika_colombia_2015$setting)
# fit the log-linear regression models by setting
best.fit2 <- fit_optim_split(i.7.setting)
best.fit2
#> $df
#>         dates   mean.R2     groups
#> 1  2015-11-02 0.8358939   Girardot
#> 2  2015-11-09 0.8434787   Girardot
#> 3  2015-11-16 0.8219235   Girardot
#> 4  2015-11-23 0.7513510   Girardot
#> 5  2015-11-30 0.6704403   Girardot
#> 6  2015-10-26 0.8370080 San Andres
#> 7  2015-11-02 0.8534918 San Andres
#> 8  2015-11-09 0.8382683 San Andres
#> 9  2015-11-16 0.7607815 San Andres
#> 10 2015-11-23 0.6914411 San Andres
#> 
#> $plot

#> 
#> $split
#>     Girardot   San Andres 
#> "2015-11-09" "2015-11-02" 
#> 
#> $fit
#> <list of incidence_fit objects>
#> 
#> attr(x, 'locations'): list of vectors with the locations of each incidence_fit object
#> 
#> 'Girardot', 'before'
#> 'San Andres', 'before'
#> 'Girardot', 'after'
#> 'San Andres', 'after'
#> 
#> $lm: regression of log-incidence over time
#> 
#> $info: list containing the following items:
#>   $r (daily growth rate):
#>   Girardot_before San Andres_before    Girardot_after  San Andres_after 
#>        0.16462990        0.08074181       -0.04155200       -0.06384633 
#> 
#>   $r.conf (confidence interval):
#>                         2.5 %      97.5 %
#> Girardot_before    0.04550440  0.28375540
#> San Andres_before  0.05653757  0.10494606
#> Girardot_after    -0.05768023 -0.02542377
#> San Andres_after  -0.08352297 -0.04416970
#> 
#>   $doubling (doubling time in days):
#>   Girardot_before San Andres_before 
#>          4.210336          8.584736 
#> 
#>   $doubling.conf (confidence interval):
#>                      2.5 %   97.5 %
#> Girardot_before   2.442763 15.23253
#> San Andres_before 6.604795 12.25994
#> 
#>   $halving (halving time in days):
#>   Girardot_after San Andres_after 
#>         16.68144         10.85649 
#> 
#>   $halving.conf (confidence interval):
#>                      2.5 %   97.5 %
#> Girardot_after   12.017066 27.26375
#> San Andres_after  8.298881 15.69282
#> 
#>   $pred: data.frame of incidence predictions (36 rows, 7 columns)
library(magrittr)
p <- plot(i.7.setting, border="white", stack = FALSE)
p %>% add_incidence_fit(fit_optim_split(i.7.setting)$fit)
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.

Created on 2018-08-09 by the reprex package (v0.2.0).

caijun commented 6 years ago

Cool, the NAs from doubling and halving estimates have been removed.

reconhub / incidence

Separate splitting dates for each group when fit_optim_split() takes groups into account? #39

> Girardot_fit_before San Andres_fit_before Girardot_fit_after

> 0.16462990 0.08074181 -0.04155200

> San Andres_fit_after

> -0.06384633

> $doubling (doubling time in days):

> Girardot_fit_before San Andres_fit_before Girardot_fit_after

> 4.210336 8.584736 NA

> San Andres_fit_after

> NA