sambrilleman / rstanarm

rstanarm R package for Bayesian applied regression modeling
http://mc-stan.org/interfaces/rstanarm.html
GNU General Public License v3.0
0 stars 1 forks source link

Error with multiple biomarkers, where one is binomial #68

Closed jburos closed 7 years ago

jburos commented 7 years ago

Summary:

Seeing the following error when I fit a model with more than one biomarker, where at least one of the biomarkers is binomial().

 Error in stan_jm(formulaLong = ...  : 
  The number of factor levels for each of the grouping factors must be the same in each of the longitudinal submodels. 

However, the number of grouping levels in the input data is the same for each biomarker.

Description:

In this case, I see the following when running stan_jm in debug mode:

Browse[2]> y_flist_padded[[1]][[cnms_nms[1]]]
#[ ... ]
92 Levels: 10_112 11_10 11_11 11_12 11_13 11_14 1_118 11_24 11_25 11_38 11_40 11_5 11_56 11_57 11_71 ... _NEW_patient_id
Browse[2]> y_flist_padded[[2]][[cnms_nms[1]]]
91 Levels: 10_112 11_10 11_11 11_12 11_13 11_14 1_118 11_24 11_25 11_38 11_40 11_5 11_56 11_57 11_71 ... 9_97

Here, the first biomarker is gaussian() where the second is binomial().

It appears that the _NEW_patient_id record is only being added for the first biomarker, and not the second. This may apply to binomial-only biomarker case, without yielding the same error message. Not yet confirmed.

Reproducible Steps:

library(rstanarm)
library(dplyr)
data(pbcSurv)
data(pbcLong)

# fit model with more than one biomarker 
# where one biomarker is of `binomial()` type
mv4 <- stan_jm(
  formulaLong = list(
    logBili ~ year + (year | id),
    albumin_high ~ sex + year + (year | id)),
  dataLong = pbcLong %>% dplyr::mutate(albumin_high = albumin > 3.5),
  formulaEvent = Surv(futimeYears, death) ~ sex + trt,
  dataEvent = pbcSurv,
  assoc = list("etavalue", "shared_b(1)"),
  family = list(gaussian(), binomial()),
  time_var = "year",
  chains = 1, cores = 1, seed = 12345, iter = 100)

RStanARM Version:

First identified using version installed from jburos/rstanarm.

Confirmed using latest master branch from sambrilleman/rstanarm:

devtools::install_github('sambrilleman/rstanarm', ref = 'master', args = '--preclean', local = TRUE)

R Version:

> getRversion()
[1] ‘3.3.3’

Operating System:

Linux / Ubuntu

jburos commented 7 years ago

Looking through the code, my guess is that the above-mentioned issue would apply to any binomial biomarker whether or not it is context of multiple biomarkers. The error message would of course only apply in the first case.

After the above commit, the following works, albeit not without a warning:

mv4 <- stan_jm(
  formulaLong = list(
    logBili ~ year + (1 | id),
    albumin_high ~ sex + year + (year | id)),
  dataLong = pbcLong %>% dplyr::mutate(albumin_high = albumin > 3.5),
  formulaEvent = Surv(futimeYears, death) ~ sex + trt,
  dataEvent = pbcSurv,
  assoc = list("etavalue", "shared_b(1)"),
  family = list(gaussian(), binomial()),
  time_var = "year",
  chains = 1, cores = 1, seed = 12345, iter = 100,
  init = 'random'
  )

Warning messages:

Warning messages:
1: In mapply(check_if_rescaled, user_priorLong, y_has_predictors, adjusted_priorLong_scale) :
  longer argument not a multiple of length of shorter
2: In mapply(check_if_rescaled, user_priorLong, y_has_predictors, adjusted_priorLong_scale) :
  longer argument not a multiple of length of shorter

However, using model-based inits yields a new error message:

mv4 <- stan_jm(
  formulaLong = list(
    logBili ~ year + (1 | id),
    albumin_high ~ sex + year + (year | id)),
  dataLong = pbcLong %>% dplyr::mutate(albumin_high = albumin > 3.5),
  formulaEvent = Surv(futimeYears, death) ~ sex + trt,
  dataEvent = pbcSurv,
  assoc = list("etavalue", "shared_b(1)"),
  family = list(gaussian(), binomial()),
  time_var = "year",
  chains = 1, cores = 1, seed = 12345, iter = 100
  )

Yields:

Multivariate joint model specified

Please note the warmup phase may be much slower than later iterations!

SAMPLING FOR MODEL 'jm' NOW (CHAIN 1).
[1] "Initialization from source failed."                                                                                                                             
[2] "mismatch in dimension declared and found in context; processing stage=initialization; variable name=aux_unscaled; position=0; dims declared=(1); dims found=(2)"
[3] "Error in sampler$call_sampler(args_list[[i]]) : "                                                                                                               
[4] "In addition: Warning messages:"                                                                                                                                 
[5] "1: In mapply(check_if_rescaled, user_priorLong, y_has_predictors, adjusted_priorLong_scale) :"                                                                  
[6] "  longer argument not a multiple of length of shorter"                                                                                                          
[7] "2: In mapply(check_if_rescaled, user_priorLong, y_has_predictors, adjusted_priorLong_scale) :"                                                                  
[8] "  longer argument not a multiple of length of shorter"                                                                                                          
error occurred during calling the sampler; sampling not done
Error in check_stanfit(stanfit) : 
  Invalid stanfit object produced please report bug
jburos commented 7 years ago

As an aside, I'm curious to know why you re-order the data for binomial observations within the handle_glmod function, here:

  # Reorder y, X, Z if bernoulli (zeros first)
  if (is.binomial(family$family) && all(y %in% 0:1)) {      
    ord    <- order(y)
    y      <- y     [ord]
    trials <- trials[ord]
    xtemp  <- xtemp [ord, , drop = FALSE]  
    Ztlist <- lapply(Ztlist, function(x) x[, ord, drop = FALSE]) 
    flist  <- lapply(flist,  function(x) x[ord]) 
    N01    <- sapply(0:1,    function(x) sum(y == x))
  } else {
    ord    <- NULL
    N01    <- rep(0L, 2)  # dud entry if not bernoulli
  }
sambrilleman commented 7 years ago

I think the commit referenced above should resolve most of these issues.

As well as the change made here.

The reason for reordering the bernoulli observations is so that I can follow Ben and Jonah's approach for incrementing the target in bernoulli models -- see their approach here. From memory, they create completely separate data blocks (i.e. design matrices, etc) for observations corresponding to 0's and 1's. I basically tried to copy that idea, but it's a little messier in the stan_jm case though because bernoulli outcomes might be combined with gaussian or other outcomes, so I can't just pass two separate data blocks (one for 0's and one for 1's) -- since jm.stan relies on one big block diagonal design matrix for all longitudinal submodels. So, instead of providing separate data blocks for the 0's and 1's I just sort the observations based on 0's and 1's and then use some indexing to identify which rows of eta to use when incrementing the target later on in Stan, as seen here and here. This may be kinda clunky? I dunno.