Closed vpnagraj closed 2 years ago
Hey @chulmelowe - any other thoughts to add on this issue? Should we be doing this the way we're doing it now? Any better approach?
The trending
package seems to have at least four methods of calculating PIs, with the differencing being how it computes the interval itself (either using a parametric bootstrap or a transformation of the CI) and whether or not uncertainty in the parameter estimate is accounted for. The first dichotomy is controlled by the simulate_pi
argument Pete references above. If simulate_pi=FALSE
(as is currently the case) then the PI is calculated by transforming the CI. If simulate_pi=TRUE
then the PI is estimated using a parametric bootstrap. Both approaches are acceptable statistically as far as I know, with the former being a more Bayesian approach where the transformation of the CI bounds is based on knowledge of the posterior distribution, and the later being more of frequentist approach. One thing I don't know and haven't looked into yet is what posterior distributions trending
, or more specifically ciTools
, assumes for the models we're using. My guess is that they aren't anything outrageous. To some extent I think that which approach you choose is a philosophical point. Mathematically the only thing to be aware of is that the transformation approach results in a more conservative prediction interval.
The other issue is whether or not to account for uncertainty in the parameter estimate when computing the PI. That's controlled by the uncertainty
argument. If uncertainty=TRUE
(the default) then the values passed to the transformation or used to define the parameters in the simulation are treated as estimates subject to estimation error, whereas if uncertainty=FALSE
they're treated as known parameters. I think in our use case there's no justification for treating them as parameters, so leaving that argument as it is now is the best way forward.
TL/DR: I think our current approach is statistically acceptable.
thanks @chulmelowe
all makes sense.
hadnt thought about the ambiguity in the posterior distributions assumed by ciTools
...
one thing ive been mulling over is what difference the choice of approach will make on width of prediction intervals. the trending
documentation notes that (as expected) allowing for uncertainty will generate a wider interval:
https://cran.r-project.org/web/packages/trending/vignettes/prediction_intervals.html
more or less recapitulating a proof in that doc:
library(trending)
library(tidyverse)
x <- rnorm(100, mean = 0)
y <- rpois(n = 100, lambda = exp(1.5 + 0.5*x))
dat <- data.frame(x = x, y = y)
fitted_model <- fit(poisson_model, dat)
pred1 <-
predict(fitted_model, simulate_pi = TRUE, alpha = 0.05) %>%
mutate(method = "simulate_pi")
pred2 <-
predict(fitted_model, simulate_pi = FALSE, uncertain = TRUE, alpha = 0.05) %>%
mutate(method = "use_ci_uncertain_true")
pred3 <-
predict(fitted_model, simulate_pi = FALSE, uncertain = FALSE, alpha = 0.05) %>%
mutate(method = "use_ci_uncertain_false")
bind_rows(pred1,pred2,pred3) %>%
select(x,estimate,lower_pi,upper_pi,method) %>%
ggplot() +
geom_ribbon(aes(x = x, ymin = lower_pi, ymax = upper_pi, fill = method), alpha = 0.25)
anyways, i dont think it changes anything as far as how we move forward. but worth keeping in mind, especially when we get to formally writing about our methods.
and for future us ... here where we define the method prediction intervals:
https://github.com/signaturescience/fiphde/blob/main/R/glm.R#L87
going to close this issue for now. we can re-open for more discussions on the topic if need be.
we should do some digging into how the GLM prediction intervals are being calculated in the
trending
ecosystem.we should be able to answer the following:
trending
create prediction intervals (either withsimulate_pi=FALSE
orsimulate_pi=TRUE
?trending
predict method?the developers have put together a vignette on the topic:
https://cran.r-project.org/web/packages/trending/vignettes/prediction_intervals.html
that would be a good place to start. they also point to a series of blog posts:
https://fromthebottomoftheheap.net/2017/05/01/glm-prediction-intervals-i/
https://fromthebottomoftheheap.net/2017/05/01/glm-prediction-intervals-ii/
for reference ... as of writing this issue we are using the
simulate_pi=FALSE
option here:https://github.com/signaturescience/fiphde/blob/main/R/glm.R#L141
and here:
https://github.com/signaturescience/fiphde/blob/main/R/glm.R#L87