Closed caijun closed 6 years ago
In this case,
fit_optim_split()
should separately return splitting dates for each group.
Agreed. The only catch here is that we will have to construct the models separately for each group instead of considering the interactions between group and date, but that shouldn't be too much of a problem :)
I agree about the separate models, but I think we may want to keep the separate split points optional (possibly defaulting to TRUE
); the advantage of have a model with interactions is that it provides a formal test for differences amongst groups.
Another thought: I wonder if we want the same behavior in these two distinct cases:
model(s) for 2 groups, but no split; in this case I would think the single model with interactions makes most sense by default
model(s) for 2 groups with a split; in this case different splitting dates for the groups would make sense by default, so effectively separate models
But that would result in different default behaviors; happy to hear your thoughts about it :)
@thibautjombart could you explain a bit more on the single model with interactions for 1)?
Sure. Approach 1 is currently implemented:
incidence ~ time * group + intercept*group
so that groups can have different slopes and intercepts (* denote additive effects and interactions)
I see. This model requires the observations from different groups are collected at the same time and of the same length? Does it apply to the example in which the epidemic in San Andres occurred earlier than in Girardot?
I agree about the separate models, but I think we may want to keep the separate split points optional (possibly defaulting to TRUE); the advantage of have a model with interactions is that it provides a formal test for differences amongst groups.
Agreed.
Another thought: I wonder if we want the same behavior in these two distinct cases
When you say 'behavior', do you mean behavior in the object returned, the plotting, or something else?
@thibautjombart, regarding the option, which one do you think is a better name?
separate_split
independent_split
separate_groups
separate_models
independent_split
? Or separate_split
separate_split
👍
In ce79a054d478702c449ebf110e4cdc9ac7c4caad, I've added a new incidence_fit_list
class, which should help alleviate some of the headaches we are running into with incidence_fit
objects in nested lists. Also, they print pretty now :)
Okay, here's a rough working example:
library(outbreaks)
girardot <- data.frame(date = rep(zika_girardot_2015$date, zika_girardot_2015$cases))
girardot$setting <- "Girardot"
sanandres <- data.frame(date = rep(zika_sanandres_2015$date, zika_sanandres_2015$cases))
sanandres$setting <- "San Andres"
zika_colombia_2015 <- rbind(girardot, sanandres)
# weekly incidence by setting
library(incidence)
i.7.setting <- incidence(zika_colombia_2015$date, interval = 7,
groups = zika_colombia_2015$setting)
# fit the log-linear regression models by setting
best.fit2 <- fit_optim_split(i.7.setting)
best.fit2
#> <list of incidence_fit objects>
#>
#> attr(x, 'locations'): list of vectors with the locations of each incidence_fit object
#>
#> 'Girardot', 'fit', 'before'
#> 'San Andres', 'fit', 'before'
#> 'Girardot', 'fit', 'after'
#> 'San Andres', 'fit', 'after'
#>
#> $lm: regression of log-incidence over time
#>
#> $info: list containing the following items:
#> $r (daily growth rate):
#> Girardot_fit_before San Andres_fit_before Girardot_fit_after
#> 0.16462990 0.08074181 -0.04155200
#> San Andres_fit_after
#> -0.06384633
#>
#> $r.conf (confidence interval):
#> 2.5 % 97.5 %
#> Girardot_fit_before 0.04550440 0.28375540
#> San Andres_fit_before 0.05653757 0.10494606
#> Girardot_fit_after -0.05768023 -0.02542377
#> San Andres_fit_after -0.08352297 -0.04416970
#>
#> $doubling (doubling time in days):
#> Girardot_fit_before San Andres_fit_before Girardot_fit_after
#> 4.210336 8.584736 NA
#> San Andres_fit_after
#> NA
#>
#> $doubling.conf (confidence interval):
#> 2.5 % 97.5 %
#> Girardot_fit_before 2.442763 15.23253
#> San Andres_fit_before 6.604795 12.25994
#> Girardot_fit_after NA NA
#> San Andres_fit_after NA NA
#>
#> $halving (halving time in days):
#> Girardot_fit_before San Andres_fit_before Girardot_fit_after
#> NA NA 16.68144
#> San Andres_fit_after
#> 10.85649
#>
#> $halving.conf (confidence interval):
#> 2.5 % 97.5 %
#> Girardot_fit_before NA NA
#> San Andres_fit_before NA NA
#> Girardot_fit_after 12.017066 27.26375
#> San Andres_fit_after 8.298881 15.69282
#>
#> $pred: data.frame of incidence predictions (36 rows, 7 columns)
library(magrittr)
p <- plot(i.7.setting, border="white", stack = FALSE)
p %>% add_incidence_fit(fit_optim_split(i.7.setting))
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.
Created on 2018-07-19 by the reprex package (v0.2.0).
Cool, this is what I expected. A little bit to discuss,
> Girardot_fit_before San Andres_fit_before Girardot_fit_after
> 0.16462990 0.08074181 -0.04155200
> San Andres_fit_after
> -0.06384633
Is it better to show the results by groups rather than by before and after?
> $doubling (doubling time in days):
> Girardot_fit_before San Andres_fit_before Girardot_fit_after
> 4.210336 8.584736 NA
> San Andres_fit_after
> NA
Do you think is it necessary to show the NA values for the after while estimating doubling time? Same question for the NA values for the before while estimating halving time.
In my opinion, the printing information should be succinct and avoid redundancy.
I think I agree with both of your points. At the moment, the printing is controlled by the print method for the incidence_fit_list
, which aggregates the data from all the incidence_fit
objects contained within. The upside is that you can put as many incidence_fit
objects anywhere within a list, and they will print like that. The downside is that it's agnostic to how the objects are related.
At the moment, I'm running fit_optim_split()
on each group separately and then adding the attributes and class for the incidence_fit_list
:
However, this hides the other information present in the output of fit_optim_split()
and makes switching between contexts difficult (plot(res)
vs plot(res$fit)
for split and non-split contexts).
Currently, the structure looks like this kafkaesque nightmare:
$res <incidence_fit_list>
$group1
$df
$plot
$split
$fit <incidence_fit_list>
$before <incidence_fit>
$after <incidence_fit>
$group2
$df
$plot
$split
$fit <incidence_fit_list>
$before <incidence_fit>
$after <incidence_fit>
Instead, I'm thinking about rearranging the result like so:
$res
$df
$plot
$group1
$group2
$split
$fit <incidence_fit_list>
$group1 <incidence_fit_list>
$before <incidence_fit>
$after <incidence_fit>
$group2 <incidence_fit_list>
$before <incidence_fit>
$after <incidence_fit>
and enabling plot.incidence_fit_list()
to detect if there are other incidence_fit_list
objects nested and use that to inform how to plot the results.
The second result arrangement is more compact. 👍
Hi @caijun,
So I've made a PR (#59) that you can review. Here's the rendered version of your example:
library(outbreaks)
girardot <- data.frame(date = rep(zika_girardot_2015$date, zika_girardot_2015$cases))
girardot$setting <- "Girardot"
sanandres <- data.frame(date = rep(zika_sanandres_2015$date, zika_sanandres_2015$cases))
sanandres$setting <- "San Andres"
zika_colombia_2015 <- rbind(girardot, sanandres)
# weekly incidence by setting
library(incidence)
i.7.setting <- incidence(zika_colombia_2015$date, interval = 7,
groups = zika_colombia_2015$setting)
# fit the log-linear regression models by setting
best.fit2 <- fit_optim_split(i.7.setting)
best.fit2
#> $df
#> dates mean.R2 groups
#> 1 2015-11-02 0.8358939 Girardot
#> 2 2015-11-09 0.8434787 Girardot
#> 3 2015-11-16 0.8219235 Girardot
#> 4 2015-11-23 0.7513510 Girardot
#> 5 2015-11-30 0.6704403 Girardot
#> 6 2015-10-26 0.8370080 San Andres
#> 7 2015-11-02 0.8534918 San Andres
#> 8 2015-11-09 0.8382683 San Andres
#> 9 2015-11-16 0.7607815 San Andres
#> 10 2015-11-23 0.6914411 San Andres
#>
#> $plot
#>
#> $split
#> Girardot San Andres
#> "2015-11-09" "2015-11-02"
#>
#> $fit
#> <list of incidence_fit objects>
#>
#> attr(x, 'locations'): list of vectors with the locations of each incidence_fit object
#>
#> 'Girardot', 'before'
#> 'San Andres', 'before'
#> 'Girardot', 'after'
#> 'San Andres', 'after'
#>
#> $lm: regression of log-incidence over time
#>
#> $info: list containing the following items:
#> $r (daily growth rate):
#> Girardot_before San Andres_before Girardot_after San Andres_after
#> 0.16462990 0.08074181 -0.04155200 -0.06384633
#>
#> $r.conf (confidence interval):
#> 2.5 % 97.5 %
#> Girardot_before 0.04550440 0.28375540
#> San Andres_before 0.05653757 0.10494606
#> Girardot_after -0.05768023 -0.02542377
#> San Andres_after -0.08352297 -0.04416970
#>
#> $doubling (doubling time in days):
#> Girardot_before San Andres_before
#> 4.210336 8.584736
#>
#> $doubling.conf (confidence interval):
#> 2.5 % 97.5 %
#> Girardot_before 2.442763 15.23253
#> San Andres_before 6.604795 12.25994
#>
#> $halving (halving time in days):
#> Girardot_after San Andres_after
#> 16.68144 10.85649
#>
#> $halving.conf (confidence interval):
#> 2.5 % 97.5 %
#> Girardot_after 12.017066 27.26375
#> San Andres_after 8.298881 15.69282
#>
#> $pred: data.frame of incidence predictions (36 rows, 7 columns)
library(magrittr)
p <- plot(i.7.setting, border="white", stack = FALSE)
p %>% add_incidence_fit(fit_optim_split(i.7.setting)$fit)
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.
#> Scale for 'colour' is already present. Adding another scale for
#> 'colour', which will replace the existing scale.
Created on 2018-08-09 by the reprex package (v0.2.0).
Cool, the NAs from doubling and halving estimates have been removed.
I am preparing the worked example for the incidence paper. The following code reproduces the problem.
The resulting plot is as follows
I think the problem occurred because the epidemic in San Andres was earlier than in Girardot. In this case,
fit_optim_split()
should separately return splitting dates for each group.