sambrilleman / rstanarm

rstanarm R package for Bayesian applied regression modeling
http://mc-stan.org/interfaces/rstanarm.html
GNU General Public License v3.0
0 stars 1 forks source link

Error in posterior_survfit: appears to be observation times in long data that are later than etimes #43

Closed jburos closed 7 years ago

jburos commented 7 years ago

Summary:

Seeing the following when trying to run posterior_survfit with newdataEvent & newdataLong.

However, on inspection, this should not be the case. Below is an example where newdata are a subset of the original data passed in.

Description:

The error appears to occur within jm_data function code, but is based on the last_times argument inferred by posterior_predict. [note: this may be limited to the case where the ids given are numeric].

Reproducible Steps:

  1. Install fork with minor bugfixes, which builds off develop2 branch of sambrilleman/rstanarm

    devtools::install_github('jburos/rstanarm', ref = 'fix-posterior-predict-newdata', args = '--preclean', local = TRUE)
  2. Fit model using simulated data

    data(pbcSurv)
    data(pbcLong)
    f1 <- stan_jm(formulaLong = logBili ~ year + (1 | id), 
                              dataLong = pbcLong,
                              formulaEvent = Surv(futimeYears, death) ~ sex + trt, 
                              dataEvent = pbcSurv,
                              time_var = "year")
  3. run posterior_survfit on newdata

    library(dplyr)
    subset <- pbcSurv %>% dplyr::distinct(id) %>% dplyr::sample_n(10)
    newdataEvent <- pbcSurv %>% 
    dplyr::semi_join(subset) %>%
    dplyr::select(id, futimeYears, death, sex, trt)
    newdataLong <- pbcLong %>% 
    dplyr::semi_join(subset) %>%
    dplyr::select(id, logBili, year)
    pp_survfit <- posterior_survfit(f1, newdataLong = newdataLong, newdataEvent = newdataEvent)

This yields the following:

 Error in FUN(X[[i]], ...) : 
  There appears to be observation times in the longitudinal data that are later than the event time specified in the 'etimes' argument. 
5.
stop("There appears to be observation times in the longitudinal data that ", 
    "are later than the event time specified in the 'etimes' argument.") at misc.R#1173
4.
FUN(X[[i]], ...) 
3.
lapply(ndL, function(x) {
    if (!time_var %in% colnames(x)) 
        STOP_no_var(time_var)
    mt <- tapply(x[[time_var]], factor(x[[id_var]]), max) ... at misc.R#1169
2.
jm_data(object, ndL, ndE, etimes = last_time[[i]], ids = id_list[[i]]) at posterior_survfit.R#385
1.
posterior_survfit(f1, newdataLong = newdataLong, newdataEvent = newdataEvent) 

Which is odd since these are a subset of the data used to fit the original model.

Running in debug mode on line 385 of posterior_survfit, I see the following:

Code:

   for (i in 1:length(id_list)) {
      # Design matrices for individual i only
      dat_i <- jm_data(object, ndL, ndE, etimes = last_time[[i]], ids = id_list[[i]])
      # Obtain mode and var-cov matrix of posterior distribution of new b pars
      # based on asymptotic assumptions, used as center and width of proposal
      # distribution in MH algorithm

State:

Browse[2]> last_time
         2         11         13         17         18         30         31         34         36         40 
 8.8323066 10.1136208 11.6167009  1.8425736  0.0000000  0.8186174  9.8288843 12.2026010 10.8309377 13.2648871 
Browse[2]> id_list
 [1] 34 40 30 36  2 18 17 31 13 11

So last_time[[i]] will be the value for id == 2, whereas the id given will be 34. So these aren't provided in the correct sort order.

RStanARM Version:

minor fork off of develop2 branch of this repo

jburos commented 7 years ago

For the record, if I run the same example converting the ids to characters:

pp_survfit <- posterior_survfit(f1,
            newdataLong = newdataLong %>% dplyr::mutate(id = as.character(id)),
            newdataEvent = newdataEvent %>% dplyr::mutate(id = as.character(id)))

I see the following state at line 385:

Browse[2]> id_list
 [1] "34" "40" "30" "36" "2"  "18" "17" "31" "13" "11"
Browse[2]> last_time
        11         13         17         18          2         30         31         34         36         40 
10.1136208 11.6167009  1.8425736  0.0000000  8.8323066  0.8186174  9.8288843 12.2026010 10.8309377 13.2648871 

Again the sort order of the last_times does not match that of the ids and the same error is given.

sambrilleman commented 7 years ago

I've just adopted your fix here, which is to re-order last_time according to id_list