missing data I/O and imputation / partial data structures

bob-carpenter commented 10 years ago

From Marcus:

Roughly speaking I was thinking of being able to write models like this;

data{
 matrix[N,M] Data;
}
transformed data {
  int missing_inds[];
  missing_inds <- find_missing_values(Data);
}
parameters {
  real missing_Data[length(missing_inds)];
}
transformed parameters{
  matrix[N,M] full_Data;

  full_Data <- fill_missing_values(Data,missing_inds,missing_Data);
}
model {
...
}

Here find_missing_values looks for NaN or something like that. If we didn't want to make it such a specific value, we call it something like find_nans so that users are clear in understanding what's going on here. Similar functions for the case where observed values are passed simply as index/value pairs would also be straightforward.

[We also need to] fix the dump file parsing to handle nan/inf.

My followup:

One attractive feature of this proposal is that we don't have to change the model for different models of missingness by column (as long as the constraints don't vary).

The one thing I was waffling back and forth on in my head was the issue of whether to take a full vector/matrix of predictors with NA-like placeholders or whether to have just the existing data passed in in "database" form:

  int n_rows;
  int n_cols;
  int n_observed;
  int observed_row[n_observed];
  int observed_column[n_observed];
  real observed_value[n_observed];

There are tighter data structures for anything but the sparsest matrices, but this representation seems easiest --- it's what you need for a "partial" matrix data structure.

kkmann commented 8 years ago

Any comments/progress on this one? Personally, an easy way of using data with arbitrary missingness patterns is the single most important reason which holds me back from using stan as my first-choice analysis tool. After all, Baysian analysis is most useful in situations with a lot of missing data, is it not?

ariddell commented 8 years ago

I'm about to start a model which has a rather large amount of missingness. I was going to take the approach bob suggested (without knowing, until now, that he suggested it):

  int n_rows;
  int n_cols;
  int n_observed;
  int observed_row[n_observed];
  int observed_column[n_observed];
  real observed_value[n_observed];

robertgrant commented 8 years ago

That's the approach I've been taking. Honestly, I prefer the Stan explicit specification of what to do with the missing/coarse data to the BUGS black box. Even with a fill_missing_data function, I'd probably carry on writing it out, and I don't know how many people are discouraged by it. If anything, I'd say it's because stan devs tend to be quite hard on stan when giving presentations, saying it can't do missing data. Here's an excerpt from a real-life coarsened example where student marks are either a percentage (Nmark), or just recorded as failed (N0) or capped at 40% on resubmission (N40):

data {
      int Nmark;
      int N0;
      int N40;
      int clinical1[Nmark];
      int clinical0[N0];
      int clinical40[N40];
      real mark[Nmark];
}
    parameters {
      real beta[2];
      real<lower=0> sigma;
      real<lower=0, upper=39.5> latentmark0[N0];
      real<lower=39.5, upper=100> latentmark40[N40];
}
model {
      real mu[Nmark];
      real mu0[N0];
      real mu40[N40];
      sigma ~ normal(0,30);
      beta ~ normal(0,20);
      for (i in 1:Nmark) {
        mu[i] <- beta[1] + beta[2]*clinical1[i];
        mark[i] ~ normal(mu[i],sigma);
      }
      for (i in 1:N0) {
        mu0[i] <- beta[1] + beta[2]*clinical0[i];
        latentmark0[i] ~ normal(mu0[i],sigma);
      }      for (i in 1:N40) {
        mu40[i] <- beta[1] + beta[2]*clinical40[i];
        latentmark40[i] ~ normal(mu40[i],sigma);
      }
}

So I like Bob's database form because it would still make the user slow down a little and think about what's going on.

betanalpha commented 8 years ago

Bayesian inference gives you a fantastic framework for modeling missingness. A priori we have no idea how the missingness is patterned in your model, and the only way you can define that pattern in a self-consistent fashion is to split up your data into the different patterns and model an explicit likelihood. Anything else is an approximation that may not even be self-consistent.

On Feb 19, 2016, at 12:24 PM, kkmann notifications@github.com wrote:

Any comments/progress on this one? Personally, an easy way of using data with arbitrary missingness patterns is the single most important reason which holds me back from using stan as my first-choice analysis tool. After all, Baysian analysis is most useful in situations with a lot of missing data, is it not?

— Reply to this email directly or view it on GitHub.

kkmann commented 8 years ago

Yeah, fully agreed. But there is also a canonical way of treating missing at random variables in a Bayesian model (one of the great advantages if you ask me). So why not combine the best of both worlds by treating missing data as missing at random by default if not specified explicitly. I think of a large clinical database with few patients and lots of variables. Often missing at random is not too far from being a realistic assumption and hand-coding this in STAN right now is really a downer.

betanalpha commented 8 years ago

No, there is not a canonical way of treating missingness! Missing at random isn’t even enough to fully define a model without specifying the distribution of the observations in the first place. Ultimately there are all kinds of subtle problems that arise as soon as you get into even the simplest details and there’s no way for us to robustly automate this process. There may be components that we can offer that simply the modeling process, but we haven’t figured those out yet.

On Feb 19, 2016, at 1:37 PM, kkmann notifications@github.com wrote:

Yeah, fully agreed. But there is also a canonical way of treating missing at random variables in a Bayesian model (one of the great advantages if you ask me). So why not combine the best of both worlds by treating missing data as missing at random by default if not specified explicitly. I think of a large clinical database with few patients and lots of variables. Often missing at random is not too far from being a realistic assumption and hand-coding this in STAN right now is really a downer.

— Reply to this email directly or view it on GitHub.

syclik commented 8 years ago

+1. I was in the middle of writing the same thing.

kkmann commented 8 years ago

So why is MAR + prior for each independent variable insufficient? It specifies a full and consistent joint distribution (given the rest of the model is not inconsistent). Anyway, any kind of tool to handle missing data more conveniently inside stan would be much appreciated from my side!

betanalpha commented 8 years ago

So why is MAR + prior for each independent variable insufficient? It specifies a full and consistent joint distribution (given the rest of the model is not inconsistent

Yes, MAR + independence can be treated somewhat easily. Although you might as well throw the missing observations away because under MAR they are not informative at all, you can readily code this up using a loop and a vector of observed indicators, which is exactly what Allen recommended! Imputing the missing observations is another story because the model and generated quantities blocks are completely independent and do not affect each other, so there’s no way to automatically generate quantities from any specification in the model block.

andrewgelman commented 8 years ago

We should have some missing-data imputation in Stan. By which I mean, not an “mi” function, at least not right away, but some examples in the manual that people can them imitate for their own work.

On Feb 19, 2016, at 8:27 AM, Michael Betancourt notifications@github.com wrote:

Bayesian inference gives you a fantastic framework for modeling missingness. A priori we have no idea how the missingness is patterned in your model, and the only way you can define that pattern in a self-consistent fashion is to split up your data into the different patterns and model an explicit likelihood. Anything else is an approximation that may not even be self-consistent.

On Feb 19, 2016, at 12:24 PM, kkmann notifications@github.com wrote:

Any comments/progress on this one? Personally, an easy way of using data with arbitrary missingness patterns is the single most important reason which holds me back from using stan as my first-choice analysis tool. After all, Baysian analysis is most useful in situations with a lot of missing data, is it not?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/646#issuecomment-186213053.

bob-carpenter commented 8 years ago

Chapter 8. Missing Data & Partially Known Parameters.

It's not that Stan can't do it, it's that it's syntactically ugly. What users want (I presume) is something like what BUGS/JAGS do.

Suppose you have the following BUGS model (with dot_product replaced by whatever you have to do in BUGS to multiply vectors and get them into the location parameter of a normal):

sigma ~ inv_gamma(1, 1) for (k in 1:K) beta[k] ~ normal(0, 0.01) for (n in 1:N) y[n] ~ normal(dot_product(x[n], beta), sigma)

If some of the predictors x[n, k] are missing, the model won't compile because you haven't defined the missing x[n, k] as either deterministic or stochastic nodes.

But, if you add this:

for (n in 1:N) for (k in 1:K) x[n, k] ~ normal(0, 0.01);

then you're good to go in BUGS and the values for missing x[n, k] will be imputed like any other parameter. You could even model those x[n, k] by predictor with something like:

 x[n, k] ~ normal(mu[k], tau[k]);

and of course give mu and tau priors.

You can get exactly the same behavior in Stan (as explained in manual Chapter 8), but the code's rather a mess. We also don't have any kind of NA input, so you have to indicate the known and unknown values in some other way (parallel array of boolean indicators, or just inputting the known ones in long form and then putting everything back together.

Bob

On Feb 19, 2016, at 5:06 PM, Andrew Gelman notifications@github.com wrote:

We should have some missing-data imputation in Stan. By which I mean, not an “mi” function, at least not right away, but some examples in the manual that people can them imitate for their own work.

On Feb 19, 2016, at 8:27 AM, Michael Betancourt notifications@github.com wrote:

Bayesian inference gives you a fantastic framework for modeling missingness. A priori we have no idea how the missingness is patterned in your model, and the only way you can define that pattern in a self-consistent fashion is to split up your data into the different patterns and model an explicit likelihood. Anything else is an approximation that may not even be self-consistent.

On Feb 19, 2016, at 12:24 PM, kkmann notifications@github.com wrote:

Any comments/progress on this one? Personally, an easy way of using data with arbitrary missingness patterns is the single most important reason which holds me back from using stan as my first-choice analysis tool. After all, Baysian analysis is most useful in situations with a lot of missing data, is it not?

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/646#issuecomment-186213053.

— Reply to this email directly or view it on GitHub.

kkmann commented 8 years ago

+1, the core of the problem seems to be due to STAN distinguishing between parameters and data. From a theoretical perspective there is no need to do so under the Bayesian paradigm and it would be nice to have a sleeker interface to get this JAGS-like behaviour for those who want it and know what they are doing.

bob-carpenter commented 8 years ago

Yes, it'd be nice to have such an interface. We'd

need a new language in which we could specify a model that's agnostic to parameters vs. modeled data distinction (the data has to be modeled in the sense of having a distribution --- it can't just be constants or predictors; that is, a stochastic node in BUGS terms). This could be the existing language with some preprocessing based on data.

Then we need to

generate a C++ model class like we currently generate, because from a computational perspective, we (and everyone else) do need to distinguish between parameters and data.

And of course, we need

some way to deal with the I/O, ideally cross-platform rather than something specific to just R or some other interface.

We're always open to volunteers. It's not on any of our existing developer's agenda as far as I know.

Bob

On Feb 20, 2016, at 6:28 AM, kkmann notifications@github.com wrote:

+1, the core of the problem seems to be due to STAN distinguishing between parameters and data. From a theoretical perspective there is no need to do so under the Bayesian paradigm and it would be nice to have a sleeker interface to get this JAGS-like behaviour for those who want it and know what they are doing.

— Reply to this email directly or view it on GitHub.

syclik commented 7 years ago

With the compound declare and define available, @mbrubake's original suggestion should be do-able. We'd need two functions and we could do this:

data{
 matrix[N,M] Data;
}
transformed data {
  int missing_inds[length_missing_values(Data)] = find_missing_values(Data);
}
parameters {
  real missing_Data[length(missing_inds)];
}
transformed parameters{
  matrix[N,M] full_Data;

  full_Data <- fill_missing_values(Data,missing_inds,missing_Data);
}
model {
...
}

I guess we could have done that before too.

Questions: is find_missing_values() appropriate for a name? This will depend on #637.

stan-dev / stan

missing data I/O and imputation / partial data structures #646

From Marcus:

My followup: