stan-dev / rstanarm

rstanarm R package for Bayesian applied regression modeling
https://mc-stan.org/rstanarm
GNU General Public License v3.0
385 stars 132 forks source link

`posterior_survfit` should return covariate data alongside the predictions #533

Open sambrilleman opened 3 years ago

sambrilleman commented 3 years ago

Summary:

For stan_surv models the id column in the data frame returned by posterior_survfit() represents the row of the covariate data that the predictions "belong" to. But the id column can be confusing, especially when the name id was used for one of the variables (e.g. the grouping factor) in the model itself.

The more explicit thing might be to return all of the covariate data used in the predictions as part of the data returned by posterior_survfit; that way the user knows exactly the covariate values that the row of predictions belongs to. Rather than having to infer it based on some "id" (i.e. row identifier) column.

This is essentially what tripped up one user here.

Reproducible Steps:

Here we fit a model with a grouping factor called id with values 5:10, then we predict survival probabilities for the groups that had id values 5 and 6

library(rstanarm)

data <- data.frame(
    trt = rep(c("A", "B"), each = 15),
    id = rep(5:10, each = 5),
    status = sample(0:1, 30, replace = TRUE),
    days = sample(10, 30, replace = TRUE))

mod = stan_surv(
    formula = Surv(days, status) ~ trt + (1 | id),
    data = data,
    basehaz = "weibull",
    chains = 1,
    iter = 20,
    cores = 1)

posterior_survfit(
    mod, 
    type = "surv",
    newdata = data[1:10,],
    times = 1,
    extrapolate = FALSE)

which returns:

stan_surv predictions
 num. individuals: 10 
 prediction type:  event free probability 
 standardised?:    no 
 conditional?:     no 

   id cond_time   time median  ci_lb  ci_ub
1   1        NA 1.0000 0.9876 0.9716 0.9950
2   2        NA 1.0000 0.9876 0.9716 0.9950
3   3        NA 1.0000 0.9876 0.9716 0.9950
4   4        NA 1.0000 0.9876 0.9716 0.9950
5   5        NA 1.0000 0.9876 0.9716 0.9950
6   6        NA 1.0000 0.9852 0.9684 0.9936
7   7        NA 1.0000 0.9852 0.9684 0.9936
8   8        NA 1.0000 0.9852 0.9684 0.9936
9   9        NA 1.0000 0.9852 0.9684 0.9936
10 10        NA 1.0000 0.9852 0.9684 0.9936

but we see that the id column in the predictions has values 1 through 10. This is because it is referring to rows 1 through 10 of the prediction covariate data, not the values for the id variable in the original model! It would be clearer / safer if we just returned all the covariates, including id, and then we wouldn't even need a row identifier.

jgabry commented 3 years ago

Yeah I see what you mean. I'm not opposed to the solution that you propose but another option could be to switch to using the column name .id instead of id to indicate that it's a metadata column (this is similar to what we're doing in the posterior package with column names like .chain instead of chain). Or maybe the name .row_id would be even less likely to be confused with a variable in the data.