GLM prediction intervals

vpnagraj commented 2 years ago

we should do some digging into how the GLM prediction intervals are being calculated in the trending ecosystem.

we should be able to answer the following:

how does trending create prediction intervals (either with simulate_pi=FALSE or simulate_pi=TRUE?
how does the choice of PI definition affect width?
what is the relationship between CIs and PIs in trending predict method?

the developers have put together a vignette on the topic:

https://cran.r-project.org/web/packages/trending/vignettes/prediction_intervals.html

that would be a good place to start. they also point to a series of blog posts:

https://fromthebottomoftheheap.net/2017/05/01/glm-prediction-intervals-i/

https://fromthebottomoftheheap.net/2017/05/01/glm-prediction-intervals-ii/

for reference ... as of writing this issue we are using the simulate_pi=FALSE option here:

https://github.com/signaturescience/fiphde/blob/main/R/glm.R#L141

and here:

https://github.com/signaturescience/fiphde/blob/main/R/glm.R#L87

stephenturner commented 2 years ago

Hey @chulmelowe - any other thoughts to add on this issue? Should we be doing this the way we're doing it now? Any better approach?

chulmelowe commented 2 years ago

The trending package seems to have at least four methods of calculating PIs, with the differencing being how it computes the interval itself (either using a parametric bootstrap or a transformation of the CI) and whether or not uncertainty in the parameter estimate is accounted for. The first dichotomy is controlled by the simulate_pi argument Pete references above. If simulate_pi=FALSE (as is currently the case) then the PI is calculated by transforming the CI. If simulate_pi=TRUE then the PI is estimated using a parametric bootstrap. Both approaches are acceptable statistically as far as I know, with the former being a more Bayesian approach where the transformation of the CI bounds is based on knowledge of the posterior distribution, and the later being more of frequentist approach. One thing I don't know and haven't looked into yet is what posterior distributions trending, or more specifically ciTools, assumes for the models we're using. My guess is that they aren't anything outrageous. To some extent I think that which approach you choose is a philosophical point. Mathematically the only thing to be aware of is that the transformation approach results in a more conservative prediction interval.

The other issue is whether or not to account for uncertainty in the parameter estimate when computing the PI. That's controlled by the uncertainty argument. If uncertainty=TRUE (the default) then the values passed to the transformation or used to define the parameters in the simulation are treated as estimates subject to estimation error, whereas if uncertainty=FALSE they're treated as known parameters. I think in our use case there's no justification for treating them as parameters, so leaving that argument as it is now is the best way forward.

TL/DR: I think our current approach is statistically acceptable.

vpnagraj commented 2 years ago

thanks @chulmelowe

all makes sense.

hadnt thought about the ambiguity in the posterior distributions assumed by ciTools ...

one thing ive been mulling over is what difference the choice of approach will make on width of prediction intervals. the trending documentation notes that (as expected) allowing for uncertainty will generate a wider interval:

https://cran.r-project.org/web/packages/trending/vignettes/prediction_intervals.html

more or less recapitulating a proof in that doc:

library(trending)
library(tidyverse)

x <- rnorm(100, mean = 0)
y <- rpois(n = 100, lambda = exp(1.5 + 0.5*x))
dat <- data.frame(x = x, y = y)

fitted_model <- fit(poisson_model, dat)
pred1 <- 
  predict(fitted_model, simulate_pi = TRUE, alpha = 0.05) %>%
  mutate(method = "simulate_pi")
pred2 <-
  predict(fitted_model, simulate_pi = FALSE, uncertain = TRUE, alpha = 0.05) %>%
  mutate(method = "use_ci_uncertain_true")
pred3 <- 
  predict(fitted_model, simulate_pi = FALSE, uncertain = FALSE, alpha = 0.05) %>%
  mutate(method = "use_ci_uncertain_false")

bind_rows(pred1,pred2,pred3) %>%
  select(x,estimate,lower_pi,upper_pi,method) %>%
  ggplot() +
  geom_ribbon(aes(x = x, ymin = lower_pi, ymax = upper_pi, fill = method), alpha = 0.25)

trending-pi-example

anyways, i dont think it changes anything as far as how we move forward. but worth keeping in mind, especially when we get to formally writing about our methods.

and for future us ... here where we define the method prediction intervals:

https://github.com/signaturescience/fiphde/blob/main/R/glm.R#L87

going to close this issue for now. we can re-open for more discussions on the topic if need be.

signaturescience / fiphde

GLM prediction intervals #80