paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.28k stars 181 forks source link

Allow estimating missing values in data #27

Closed paul-buerkner closed 6 years ago

paul-buerkner commented 8 years ago

Currently, rows in data that contain missing values (i.e., NAs) are removed and thus completely ignored, because Stan does not allow NAs in the data. However, it would certainly be nice to have a JAGS like behavior of estimating missing responses automatically.

One option is to add some Stan code for estimating missing responses as described in Section 8 of the Stan 2.9.0 Reference Manual. Another possibility is to use the predict method with newdata within the brm function to estimate missing responses after fitting the model. The former would have the advantage that users could easily take the Stan code dealing with missing data and generalize it beyond what brms offers. The latter would be somewhat cleaner overall and also much easier to implement.

paul-buerkner commented 8 years ago

In fact, it's not just about missing values in the response variable. With Stan, we are also able to impute missing values in the predictors, although the Stan code might get a bit tricky when using design matrices instead of coding every predictor separately.

brockf commented 8 years ago

In the mean time, would it make sense to have a warning that rows with NA values are (a) present and, (b) being ignored?

paul-buerkner commented 8 years ago

Absolutely, I will add such a warning.

hoangtt1989 commented 8 years ago

Does this mean that the package is incapable of modeling longitudinal data where not every subject is measured at every time point?

jgabry commented 8 years ago

Paul can give you a definitive answer about brms in this regard, but you should definitely be able to at least fit some kinds of models even if not every subject is measured at every time point. But if time points aren't measured and equally spaced for each individual it would be tricky to get the covariance matrices for an AR process or something like that, without coding it directly in Stan.

But, for example, with rstanarm (and so I imagine brms too) you could still do something like y ~ x + (1 + time | person), you just wouldn't get all of the parameters you would get if you had everyone measured at every time. And you should also be able to make predictions for the missing times.

paul-buerkner commented 8 years ago

Jonah is absolutely right. This issue is just about estimating missing values (NAs in your data frame) based on the non-missing data in addition to estimating all the model parameters. The current behavoir of brm is the exact same as for most other model fitting functions, such as lm, glm and (stan_)glmer that is rows in your data containing NAs in relevant variables are removed from the model.

hoangtt1989 commented 8 years ago

Actually, I should clarify about my previous post. We don't really have "missing" data per the definition in our data set (no records with NAs) but we have an unbalanced design meaning not the same amount of measurements per subject, nor at equally spaced time points.

Each observation in our data set is a sample. The samples belong to subjects, where we have a variable named studyid that represents the id for each subject. We also have a viral_colldt variable that is a time stamp for when the sample was collected. The number of samples per subject varies between 3 and 12, with a median of 7 samples per subject. The model formula is something like

y ~ x + (1|studyid)

Where x is a categorical variable and y is a variable of discrete counts. When I run brm, I get the following message:

Warning messages: 1: contrasts dropped from factor studyid due to missing levels 2: contrasts dropped from factor studyid due to missing levels 3: contrasts dropped from factor studyid due to missing levels

paul-buerkner commented 8 years ago

This model can of course be fitted with brms. With regard to the warnings, apparently, there are factor levels of studyid, which are not actually present in the data and thus contrasts are dropped. Contrasts doesn't matter anyway for group-level effects, so it shouldn't be a problem You can easily verify which levels are not used by comparing

unique(your_data$studyid)
levels(your_data$studyid)
hoangtt1989 commented 8 years ago

Thank you for the clarification.

ASKurz commented 7 years ago

Hey Paul. Are there any plans on this topic for the foreseeable future?

paul-buerkner commented 7 years ago

It is not high priority at the moment to be honest. In one or two months I will have more time implementing some more features in brms. Maybe this will be implemented then. In the meantime, I happy for ideas on the (formula) syntax to specify how missing values should be imputed.

ASKurz commented 7 years ago

That's understandable. Full disclosure: The depth of my programming experience is working with Mplus and R packages. But I’ll give the syntax challenge a shot. Let:

d = our data frame Y = our main criterion, ~ N(0, 1) X1 = our sole predictor of Y, ~ N(0, 1) A1 = auxiliary variable predicting missingness on X1, ~ N(0, 1)

Let’s presume missingness on X1, and further that we’d like to model X1 with A1. We might extend your bf syntax sensibilities with something like this:

brm(data = d, family = "gaussian", bf(Y ~ X1, X1 ~ A1), prior = c(set_prior("normal(0, 1)", class = "Intercept"), set_prior("normal(0, 1)", class = "b", coef = “X1”), set_prior("normal(0, 1)", class = "b", coef = “A1”), set_prior("cauchy(0, 1)", class = "sigma"))

It’d be less clear with missingness on Y. You’d need an option telling brm to impute. Perhaps something like

brm(data = d, family = "gaussian", NA = “impute”, Y ~ X1)

But then there are the cases in which you’d want to model the criterion with a different distributional family than the predictor with missingness. Let X1 now be dummy variable with missingness. Perhaps you’d augment the family argument to something like:

brm(data = d, family = c(Y ~ "gaussian", X1 ~ “binomial”), bf(Y ~ X1, X1 ~ A1), prior = c(set_prior("normal(0, 1)", class = "Intercept"), set_prior("normal(0, 1)", class = "b", coef = “X1”), set_prior("normal(0, 1)", class = "b", coef = “A1”), set_prior("cauchy(0, 1)", class = "sigma"))

Another can of worms: When your predictors have missingness you model with auxiliary variables you end up with additional intercepts and slopes for which you might want to specify particular priors. How about adding another argument to set_prior? Perhaps “crit”, with which you could specify the desired criterion. For example:

brm(data = d, family = “gaussian”, bf(Y ~ X1, X1 ~ A1), prior = c(set_prior("normal(0, 1)", class = "Intercept", crit = “Y”), set_prior("normal(0, 2)", class = "Intercept", crit = “X1), set_prior("normal(0, 1)", class = "b", coef = “X1”), set_prior("normal(0, 1)", class = "b", coef = “A1”), set_prior("cauchy(0, 1)", class = "sigma"))

paul-buerkner commented 7 years ago

Thanks for the good suggestions! I am not sure how I will end up implementing things but your ideas definitely help me a lot!

Missing value imputation for the response variable Y is already possible. Simply run

predict(., newdata = <data where Y is missing>)

after fitting the model.

m-clark commented 7 years ago

If you haven't yet, you might also check the Statistical Rethinking text, where McElreath has a chapter on it (c-14) with examples in Stan (for both Y and X missingness), and his associated rethinking R package can be used to generate example code. http://xcelab.net/rm/statistical-rethinking/

rubenarslan commented 7 years ago

Here's a small helper function I wrote for using multiply imputed datasets with brms (until this feature is implemented 😄 ). Hope it helps someone. Should work with imputed datasets from mice, Amelia and mitml.

n_imputations.mitml = function(imps) {
  imps$iter$m
}
n_imputations.mids = function(imps) {
  imps$m
}
n_imputations.amelia = function(imps) {
  length(imps$imputations)
}
n_imputations = function(imps) { UseMethod("n_imputations") }

complete.mitml = mitml::mitmlComplete
complete.amelia = function(imps, action = "long") {
  if (action == "long") {
    dplyr::bind_rows(imps$imputations, .id = "m")
  } else if (is.numeric(action)) {
    imps$imputations[[action]]
  }
}
complete.mids = mice::complete
complete = function(...) { UseMethod("complete") }

brm_i = function(..., data) {
  models = list()
  message("Imputation ", 1)
  models[[1]] = brm(..., data = complete(data, 1))
  rhats = data.frame(as.list(rhat(models[[1]])))
  if (any(rhats > 1.1)) {
    warning("Model did not converge on first imputation. Returning model.")
    models[[1]]
  } else {
    m = n_imputations(data)
    if (m > 1) {
      for (i in 2:m) {
        message("Imputation ", i)
        models[[i]] = update(models[[1]], newdata = complete(data, i))
        rhats = dplyr::bind_rows(rhats, data.frame(as.list(rhat(models[[i]]))))
      }
      # merge stanfit objects and assign them to the fit slot of the first brm
      models[[1]]$fit = rstan::sflist2stanfit(lapply(models, function(model) model$fit))
      cat("Rhats for each imputation\n")
      print(rhats)
    }
    cat("Average Rhats\n")
    print(colMeans(rhats))
    models[[1]]
  }
}
paul-buerkner commented 7 years ago

Amazing, thanks Ruben!

paul-buerkner commented 6 years ago

Inspired by Ruben's helper function, I just added the brm_multiple function (name temporary), which works on multiple imputed data sets and combines the results into one fitted model object.

claudiofronterre commented 6 years ago

Hi Paul,

I was thinking if it would also be possible to implement a way of treating missing data for multivariate outcome like the one suggested in Section 11.5 of the Stan manual (ver. 2.17.0) .

In this case you don't impute missing data but you fit the joint model when you observe both the outcome and you fit the marginal models when you observe just one of the outcome (since this chunk of data will still give you information about the marginal parameters, but not about the joint ones of course).

I wrote some working Stan code for the logistic case if could be useful.

paul-buerkner commented 6 years ago

Thanks Claudio. This is certainly something to consider. Previously, I haven't worrid about imputing missing responses too much, because we can equivelently do it after fitting the model using post-processing R code (predict in this case).

However, for multivariate models, with partial missing responses, it really makes sense to handle it inside Stan. I am happy to take a look at you Stan code!

claudiofronterre commented 6 years ago

You can find it here. I posted it on the Stan forum. There is a full reproducible example with R. It handles correlated random effects as well (this feature is already implemented in your package).

It's not efficient Stan code but it works!

paul-buerkner commented 6 years ago

Here is my first proposal for the syntax of missing value imputation within the model. Suppose y is the main response variable, x is a predictor containing missing values, and z is a predictor without missing values, which is used to predict both y and x. We can build on brms' multivariate syntax as follows:

formula = bf(x | mi() ~ z) + bf(y ~ z + mi(x))

Essentially x is another response variable with the addition argument mi telling that missings in x should be modeled. We can add arguments to resp_mi() (which is the actual function that is called internally), if there is something that needs to be specified about the missing value process that cannot be specified elsewhere.

If x is used on the right-hand side of a formula and thus as a predictor in another model (of y in this case), we wrap it around mi() to tell brms that this variable contains modeled missing values.

Suggestions are welcome.

ASKurz commented 6 years ago

This seems sensible. How would you set up the priors associated with x | mi() ~ z?

paul-buerkner commented 6 years ago

With priors you mean the distribution of x? Since x | mi() is just an ordinary response variable (with missings allowed), you can specify the distribution via the family argument, either in brm or bf. See vignette("brms_multivariate") for some examples.

rubenarslan commented 6 years ago

Would it be possible to model measurement error and missingness using mi()? Or would that be possible for a later step?

paul-buerkner commented 6 years ago

I am currently in the process of implementing mi() and I can tell that the first implementation will likely not support measurement error. However, I will think of a good way to add it in the future. Any suggestions are welcome.

paul-buerkner commented 6 years ago

The feature is now mostly working on the branch "feature-issue-27". Most of the documentation and unit tests are still missing, but you can see an example on the ?mi help page.

paul-buerkner commented 6 years ago

Closed via PR #337.

capplestein commented 5 years ago

It doesn't seem like the mi() call works when family is specified as "poisson." I get the error: "Error: Argument ‘mi’ is not supported for family ‘poisson(log)’." Are there any plans to extend mi() to count data, possibly extended through the "21.2stage.pois" approach in micemd package?

paul-buerkner commented 5 years ago

Stan does not support discrete parameters, which is required for estimation of poisson missing values. At some point this may change but until then brms won't be able to support this.