stan-dev / rstanarm

rstanarm R package for Bayesian applied regression modeling
https://mc-stan.org/rstanarm
GNU General Public License v3.0
388 stars 133 forks source link

Problem with using `kfold` with stansurv objects? #473

Open jscamac opened 4 years ago

jscamac commented 4 years ago

Summary:

There appears to be a problem with using kfold on stan_surv models. Error returned is: Error: passing unknown arguments: subset.

Description:

I'm fitting multi-level survival models with time-varying using rstanarm::stan_surv models and want to use kfold cross validation where groups of individuals are heldout (individuals are repeatedly surveyed overtime). I tried aggregating the log likelihoods as done in the time-varying coefficient example in the Arxiv paper (https://arxiv.org/pdf/2002.09633.pdf), however I get pretty poor Pareto k diagnostic values. So I thought using kfold might be a better alternative, where I group individuals into folds. However, when I do this I get the following error Error: passing unknown arguments: subset. NOTE I also get this error if I don't stratify/group the data into folds, and if I exclude the random effect for individual. Any advice welcomed!

Reproducible Steps:

Taking the reproducible example presented in the Arxiv paper:

library(rstanarm)
library(survival)

dat <- survival::pbc
dat <- dat[dat$id <= 312, ]

dat <- tmerge(dat, dat, id = id,death = event(time, as.numeric(status == 2)))

dat <- tmerge(dat, survival::pbcseq, id = id,ascites  = tdc(day, ascites),
              bili = tdc(day, bili),
              albumin = tdc(day, albumin),
              protime = tdc(day, protime),
              alk.phos = tdc(day, alk.phos))

dat <- dat[, c("id", "tstart", "tstop", "death", "bili", "protime")]

mod_tvc <- stan_surv(formula = Surv(tstart, tstop, death) ~ log(bili) + log(protime) + (1|id), data = dat)

inds <- loo::kfold_split_grouped(K= 10, x =mod_tvc$data$id)
kfold(mod_tvc, folds = inds)
Fitting model 1 out of 10
Error: passing unknown arguments: subset.

RStanARM Version:

packageVersion("rstanarm")
[1] ‘2.21.2’

packageVersion("loo")
[1] ‘2.3.1’

R Version:

getRversion()
[1] ‘4.0.1’

Operating System:

OSX 10.14.6

sambrilleman commented 4 years ago

Woops, yeah this is almost certainly because I didn't include the subset argument as an option in the stan_surv() signature.

It could probably be added. It might not be too hard, but I wouldn't know without taking a closer look. If anyone wants to have a go at adding that functionality and submitting a PR, feel free (I don't have much spare time to work on this atm).

It probably wasn't caught earlier since I probably didn't write tests for testing stan_surv() with that post-estimation functionality.

Sorry this feature isn't available! And I'm not sure there is an easy quick fix -- unless there is a hack available where you can subset the data manually outside of the modelling function (e.g. using a loop). @jgabry or @avehtari might be able to advise on that, perhaps.

jgabry commented 4 years ago

We might be able to change the kfold code to not require the subset argument. It's not essential, but we didn't anticipate a modeling function without the subset argument so the code currently expects it. In the meantime,

unless there is a hack available where you can subset the data manually outside of the modelling function (e.g. using a loop).

this should be possible (by changing the data argument each time through the loop). The kfold method in rstanarm is just there for convenience. It doesn't do anything special that couldn't be done manually. It looks like there's a lot of code required

https://github.com/stan-dev/rstanarm/blob/master/R/loo-kfold.R

but most of it is just to get it working in as many cases as possible. If using that file as a reference, the only crucial parts are changing the data argument each time, evalauting log_lik(), and then calculating elpd.

johnbaums commented 4 years ago

Here's a quick workaround

f <- eval(parse(
  text=sub(
    'fit_k_call$subset <- eval(fit_k_call$subset)', 
    'fit_k_call$subset <- if(is.stansurv(x)) NULL else eval(fit_k_call$subset)', 
    deparse(rstanarm:::kfold.stanreg), fixed=TRUE)
), envir=environment(rstanarm:::kfold.stanreg))

assignInNamespace('kfold.stanreg', value=f, ns='rstanarm')

The relevant line is: https://github.com/stan-dev/rstanarm/blob/99b2ae0840b7e15bbd340718da9b41b4dee3ee64/R/loo-kfold.R#L171

jgabry commented 4 years ago

Thanks @johnbaums!