Error in ps_check(jm_fit): The same subject ids should appear in each new data frame.

jburos commented 7 years ago

Summary:

I see the following error when running ps_check on a stan_jm fit object:

> rstanarm::ps_check(fit)
Error in validate_newdatas(object, newdataLong, newdataEvent) : 
  The same subject ids should appear in each new data frame.

This is strange because there are no newdata passed in, so presumably there should not be an issue with the subject ids.

Description:

The problem appears to be a result of this line:

Here is the full traceback of the above:

7: stop("The same subject ids should appear in each new data frame.") at misc.R#1045
6: validate_newdatas(object, newdataLong, newdataEvent) at misc.R#1145
5: jm_data(object, newdataLong = ndL, newdataEvent = ndE, ids = id_list, 
       etimes = t, long_parts = FALSE) at posterior_survfit.R#420
4: FUN(X[[i]], ...)
3: lapply(time_seq, function(t) {
       if (!identical(length(t), length(id_list))) 
           stop("Bug found: the vector of prediction times is not the same length ", 
               "as the number of individuals.")
       dat <- jm_data(object, newdataLong = ndL, newdataEvent = ndE, 
           ids = id_list, etimes = t, long_parts = FALSE)
       surv_t <- ll_event(object, data = dat, pars = pars, survprob = TRUE)
       if (is.vector(surv_t) == 1L) 
           surv_t <- t(surv_t)
       surv_t[, (t == 0)] <- 1
       if (standardise) {
           surv_t <- matrix(rowMeans(surv_t), ncol = 1)
           dimnames(surv_t) <- list(iterations = NULL, "standardised_survprob")
       }
       else {
           dimnames(surv_t) <- list(iterations = NULL, ids = id_list)
       }
       surv_t
   }) at posterior_survfit.R#416
2: posterior_survfit(object, standardise = TRUE, control = list(condition = FALSE), 
       times = 0, extrapolate = TRUE, draws = draws, seed = seed) at ps_check.R#80
1: rstanarm::ps_check(fit)

Reproducible Steps:

(similar to those in #43):

Install fork with minor bugfixes, which builds off develop2 branch of sambrilleman/rstanarm

devtools::install_github('jburos/rstanarm', ref = 'fix-posterior-predict-newdata', args = '--preclean', local = TRUE)

Fit model & run ps_check on example data (works fine)

data(pbcSurv)
data(pbcLong)
f1 <- stan_jm(formulaLong = logBili ~ year + (1 | id), 
                          dataLong = pbcLong,
                          formulaEvent = Surv(futimeYears, death) ~ sex + trt, 
                          dataEvent = pbcSurv,
                          time_var = "year")
ps_check(f1)

Fit model & run ps_check on same data, where ids are character values

f2 <- stan_jm(formulaLong = logBili ~ year + (1 | id), 
          dataLong = pbcLong %>% 
            dplyr::mutate(id = paste0(as.character(id, '-new'))) %>%
            dplyr::arrange(id, year),
          formulaEvent = Surv(futimeYears, death) ~ sex + trt, 
          dataEvent = pbcSurv %>% 
            dplyr::mutate(id = paste0(as.character(id, '-new'))) %>%
            dplyr::arrange(id),
          time_var = "year",
          id_var = 'id')
ps_check(f2)

Yields:

Error in validate_newdatas(object, newdataLong, newdataEvent) : 
  The same subject ids should appear in each new data frame.

RStanARM Version:

See above

jburos commented 7 years ago

Notes from debugging on example above, working back along stacktrace:

in `validate_newdatas`:

Browse[2]> sorted_ids
$Long1
 [1] 1  10 11 12 13 14 15 16 17 18 19 2  20 21 22 23 24 25 26 27 28 29 3  30 31 32 33 34 35 36 37 38 39 4  40 5  6  7  8  9 
Levels: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30 31 32 33 34 35 36 37 38 39 4 40 5 6 7 8 9

$Event
 [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2"  "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "3"  "30" "31" "32" "33" "34" "35" "36"
[31] "37" "38" "39" "4"  "40" "5"  "6"  "7"  "8"  "9" 

Browse[2]> ids
$Long1
 [1] 1  10 11 12 13 14 15 16 17 18 19 2  20 21 22 23 24 25 26 27 28 29 3  30 31 32 33 34 35 36 37 38 39 4  40 5  6  7  8  9 
Levels: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30 31 32 33 34 35 36 37 38 39 4 40 5 6 7 8 9

$Event
 [1] "1"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2"  "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "3"  "30" "31" "32" "33" "34" "35" "36"
[31] "37" "38" "39" "4"  "40" "5"  "6"  "7"  "8"  "9"

in `jm_data`:

Browse[2]> str(newdataLong[[1]]$id)
 Factor w/ 40 levels "1","10","11",..: 1 1 2 3 3 3 3 3 3 3 ...
Browse[2]> str(newdataEvent$id)
 chr [1:40] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "23" "24" "25" "26" "27" ...

in `posterior_survfit`:

The above reflect the data as stored in the stanjm object.

Looking at lines 257-265 of posterior_survfit.R:

  # Construct prediction data
  # ndL: dataLong to be used in predictions
  # ndE: dataEvent to be used in predictions
  if (!identical(is.null(newdataLong), is.null(newdataEvent)))
    stop("Both newdataLong and newdataEvent must be supplied together.")
  if (is.null(newdataLong)) { # user did not specify newdata
    ndL <- model.frame(object)[1:M]          ## <- here 
    ndE <- model.frame(object)$Event       ## <- and here
  } else { # user specified newdata

These objects are returned with the following structure:

Browse[2]> str(ndL[[1]]$id)
 Factor w/ 40 levels "1","10","11",..: 1 1 2 3 3 3 3 3 3 3 ...
Browse[2]> str(ndE$id)
 chr [1:40] "1" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "2" "20" "21" "22" "23" "24" "25" "26" "27" ...

What is needed is either to store the object more consistently, or transform the factors correctly back to their chr counterparts.

sambrilleman commented 7 years ago

Yeah, it seems that in the lme4 model frame, character grouping variables are promoted to factors, but integer or numeric grouping variables are not promoted to factors. I'm not sure why that is.

Instead of changing things at the ps_check stage, I went back to stan_jm and changed the class of the id variable stored in the model frame for the event submodel (such that it is promoted from a character to a factor).

Hopefully this works.

jburos commented 7 years ago

Agreed, this makes sense going forward. Thanks Sam. FWIW, a lot of my mangling was to avoid having to re-execute my models in the interim.

sambrilleman / rstanarm