stan-dev / rstan

RStan, the R interface to Stan
https://mc-stan.org
1.03k stars 264 forks source link

support I/O of NaN, +inf, -inf values #67

Open bob-carpenter opened 10 years ago

bob-carpenter commented 10 years ago

This is on the border between bug and feature. We should be reading in NaN and +inf and -inf values through all our interfaces. I think the output's OK now, but we can't input NaN or inf now. Or if we can, it's not clear to me or our users how to do it.

The corresponding issue for CmdStan: https://github.com/stan-dev/stan/issues/637

I'm not sure what the status is of PyStan on this.

ariddell commented 10 years ago

I'm happy to copy/translate RStan's tests -- do they exist for this?

bob-carpenter commented 10 years ago

I'm afraid not --- RStan still doesn't support I/O for NaN or inf.

So I'm not sure what you'd need to do to get it into Python.

It's not a huge issue, because rarely is data or inits going to be NaN or +/- inf. The request came from someone wanting to code R's NA ("undefined") value, which isn't really possible on the Stan C++ side, because we're just using IEEE double-precision floating point for arithmetic.

It looks like Python lets you check if a variable is defined, but doesn't let you define it to have an NA-like value, which is what R does. But I'm not a Python expert (I'm barely a Python novice).

On May 5, 2014, at 7:29 PM, Allen Riddell notifications@github.com wrote:

I'm happy to copy/translate RStan's tests -- do they exist for this?

— Reply to this email directly or view it on GitHub.

bgoodri commented 10 years ago

I think we need a more general thinking about special values in Stan. I can't see what sense it makes to have Inf as input, but maybe as output. For missing values, it is often useful to distinguish between a value that could exist but happens to be missing vs. a value that is observed to be undefined or inapplicable. With mi, we use NA for the former and NaN for the latter. Another option would be to do what Stata does (which may be wise if there is ever a Stata interface) and reserve the last 27 values that are supported for that numeric type for various missingness codes, which no one should be using anyway because they will likely overflow. See http://www.stata.com/help.cgi?dta_115#numbers

bob-carpenter commented 10 years ago

On May 6, 2014, at 1:05 AM, bgoodri notifications@github.com wrote:

I think we need a more general thinking about special values in Stan. I can't see what sense it makes to have Inf as input,

I don't see much use for it yet. It could be used for bounds if we allow variable bounds. Negative inf could be used as data for a value that's exponentiated to produce 0. In general, though, I just think they're valid IEEE numeric values we should handle.

but maybe as output. For missing values, it is often useful to distinguish between a value that could exist but happens to be missing vs. a value that is observed to be undefined or inapplicable.

Absolutely. I get what R does with NA.

With mi, we use NA for the former and NaN for the latter. Another option would be to do what Stata does (which may be wise if there is ever a Stata interface)

Yes, we're going to do a Stata interface for IES.

and reserve the last 27 values that are supported for that numeric type for various missingness codes, which no one should be using anyway because they will likely overflow. See http://www.stata.com/help.cgi?dta_115#numbers

That's an interesting approach. I really don't like messing with arithmetic, though. I'd really rather just avoid the NA thing altogether. But I could be convinced otherwise.

bgoodri commented 10 years ago

Technically, you could exp(-Inf) to get 0 in transformed data but why would someone do that when they could just pass the data on the antilog scale? Anyway, that is just -Inf. For +Inf, I can't think of any reason why that would be reasonable input. So one option would be for RStan and other interfaces to map the missing NAs to +Inf. Then, NaN could be used for observed to be undefined.

Another option would be to distinguish between quiet and signaling NaN.

The Stata approach (in addition to being compatible with Stata) doesn't necessarily require messing with arithmetic. We would need to change check_finite() to return an exception if any of the values are >= the 27th biggest representable value. That way, you get an exception with a nice message if it gets passed to a density function.

If we supported sparse Eigen::Matrix, we could do that thing I suggested at one point where missing values were treated as "zeros" and then shadow that with a sparse matrix of parameters where the originally observed values were treated as "zeros".

I don't think ignoring NA altogether is a viable option because missingness is so common and all of the other statistical software deals with them in one way or another.

bob-carpenter commented 10 years ago

On May 6, 2014, at 2:40 PM, bgoodri notifications@github.com wrote:

Technically, you could exp(-Inf) to get 0 in transformed data but why would someone do that when they could just pass the data on the antilog scale? Anyway, that is just -Inf. For +Inf, I can't think of any reason why that would be reasonable input. So one option would be for RStan and other interfaces to map the missing NAs to +Inf. Then, NaN could be used for observed to be undefined.

I'd rather go with some random high valued finite values a la Stata than to mess with unique special values. I think re-using any of these is asking for trouble, but then I'm not really in favor of adding a NaN --- like you'd prefer antilog rather than -inf, I'd rather they just code the missing data as parameters directly or use a database-like notation for existing values.

I know it's not the R or BUGS/JAGS way.

Another option would be to distinguish between quiet and signaling NaN.

We could perhaps use a signaling NaN if they're implemented on all of our platforms and compilers. I found this useful:

http://en.wikipedia.org/wiki/NaN#Signaling_NaN

I haven't tried playing with clang++ on the Mac to swee what happens with std::numeric_limits.

The Stata approach (in addition to being compatible with Stata) doesn't necessarily require messing with arithmetic. We would need to change check_finite() to return an exception if any of the values are >= the 27th biggest representable value. That way, you get an exception with a nice message if it gets passed to a density function.

If we supported sparse Eigen::Matrix, we could do that thing I suggested at one point where missing values were treated as "zeros" and then shadow that with a sparse matrix of parameters where the originally observed values were treated as "zeros".

I'd think we'd want to separate two data structures, sparse matrices, which have zeros, and partial matrices, where only some entries are observed.

I don't think ignoring NA altogether is a viable option because missingness is so common and all of the other statistical software deals with them in one way or another.

R and BUGS do, but obviously not "all" stat software does!

Let's discuss this with more bandwidth at the next Stan meeting we're both at.

bgoodri commented 10 years ago

Summarizing my view, here is basically all you can do with Stan currently

data {
  int<lower=1> K;              // number of variables
  int<lower=1> N;              // number of observations
  matrix[N,K]  X;              // data matrix
  int<lower=0,upper=1> R[N,K]; // two-dimensional array that indicates X element is observed
}
transformed data {
  int<lower=0> n_miss;
  n_miss <- 0;
  for(i in 1:N) for(j in 1:K) if(R[i,j] == 0) n_miss <- n_miss + 1;
}
parameters {
  vector[K] mu;                 # expectation of rows of X
  cholesky_factor_cov[K,K] L;   # Cholesky factor of covariance matrix
  vector[n_miss] X_miss;        # missing values to fill in
}
model {
  /* fill in missing data values as necessary and evaluate multinormal */
  count <- 1;
  for(i in 1:N) {
    vector[K] Y;
    for(j in 1:K) {
      if(R[i,j] == 1) Y[j] <- X[i,j];
      else {
        Y[j] <- X_miss[count];
        count <- count + 1;
      }
    }
    Y ~ multi_normal_cholesky(mu, L);
  }
  /* priors on mu and L */  
}

This behavior is bad in several respects. First, if X were accidentally used in the model block, Stan has no way to catch this mistake because the elements of X where R == 0 have some arbitrary but meaningless numeric value that Stan I/O accepts. Second, it is tedious and error prone to do all that fiddling with R.

I still think there was no consensus as to whether it would be a good idea for Stan to have special values for missing or undefined, but I think that if these existed, then it would then be possible to do stuff more easily with Stan. So, let's suppose missing is a quiet NaN and undefined is a signaling NaN. You could introduce some syntax like

data {
  int<lower=1> K;              // number of variables
  int<lower=1> N;              // number of observations
  matrix[N,K]  X;              // data matrix with missing values
}
parameters {
  vector[K] mu;                 # expectation of rows of X
  cholesky_factor_cov[K,K] L;   # Cholesky factor of covariance matrix
  /* an object with the same dimensions as X but with missing values replaced by unknowns */
  X_complete(X);
}

I suppose you only write the unknown elements of X_complete to the .csv file and permit something like

generated quantities {
  X_complete2(X)
}

if the user actually wants to store tons of matrices that mix observed and estimated elements.

That only works for continuous things. For discrete objects, counts are a fundamental problem. But for categoricals, I don't think it would be so hard to change the behavior of the PMFs to marginalize over the missing values. For example, bernoulli_logit would do one thing if y == 1, another if y == 0, and mix those if y == quiet_NaN.

That's why I think that some progress can be made if Stan would just do what every other statistical software does and have a way to represent missingness so that the appropriate functions could recognize it.

isaactpetersen commented 8 years ago

It's been a couple years since the last message in this thread--any updates on allowing missing data (e.g., via imputation)? Missing data are quite common in data analysis. Having the ability to handle missing data would make Stan much more flexible.

bgoodri commented 8 years ago

Little has changed. With R-style indexing, it takes a bit less typing to do what could be done before.

On Sun, May 15, 2016 at 12:01 AM, dadrivr notifications@github.com wrote:

It's been a couple years since the last message in this thread--any updates on allowing missing data (e.g., via imputation)? Missing data are quite common in data analysis. Having the ability to handle missing data would make Stan much more flexible.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/stan-dev/rstan/issues/67#issuecomment-219265043

isaactpetersen commented 8 years ago

Thanks for the update! Hope that handling missing data is a consideration for the future. I love Stan and would love for it to be able to handle that data that I work with! Thanks for all of your work on the project.

bob-carpenter commented 8 years ago

It's not impossible to code up missing data models in Stan now, just a pain because of the index fiddling and declarations required. We don't really have any good support for multiple imputation on the outside, either. Maybe a case study would help with these.

On May 15, 2016, at 1:30 PM, dadrivr notifications@github.com wrote:

Thanks for the update! Hope that handling missing data is a consideration for the future. I love Stan and would love for it to be able to handle that data that I work with! Thanks for all of your work on the project.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub