stan-dev / math

The Stan Math Library is a C++ template library for automatic differentiation of any order using forward, reverse, and mixed modes. It includes a range of built-in functions for probabilistic modeling, linear algebra, and equation solving.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
735 stars 185 forks source link

Errors when right-censoring: bug in normal_lccdf? #2847

Open simoncolumbus opened 1 year ago

simoncolumbus commented 1 year ago

I've been running into issues fitting models with a right-censored normal distribution, using brms. I've submitted an issue with brms here: https://github.com/paul-buerkner/brms/issues/1423 However, it appears that the error does not arise from brms. @paul-buerkner suggested it might instead arise from a bug in normal_lccdf, so I thought I'd post it here.

I appears that the issue occurs only

Below is my original post to the brms issue tracker, containing a MWE. Apologies for not testing this with other Stan interfaces.


# Censoring example
set.seed(25)
n <- 5000
t1 <- 106 # Right censoring cutoff
t2 <- 94 # Left censoring cutoff

d <- tibble(y = rnorm(n, mean = 100, sd = 15),
            x = rnorm(n, 0, 1)) %>% 
  # Right censored variables
  mutate(y1   = if_else(y > t1, t1, y),
         cen1 = if_else(y > t1, "right", "none")) %>%
  # Left censored variables
  mutate(y2 = if_else(y < t2, t2, y),
         cen2 = if_else(y < t2, "left", "none"))

# Right censoring
fit <- brm(bf(y1 | cens(cen1) ~ x),
    data = d)

Compiling Stan program...
Start sampling
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: Rejecting initial value:
Chain 1:   Log probability evaluates to log(0), i.e. negative infinity.
Chain 1:   Stan can't start sampling from this initial value.
...
Chain 1: Initialization between (-2, 2) failed after 100 attempts. 
Chain 1:  Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
[1] "error occurred during calling the sampler; sampling not done"
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y1 | cens(cen1) ~ x 
   Data: d (Number of observations: 5000) 

This occurs across multiple datasets and different kinds of models (linear mixed model with one random factor; mixed-effects location scale model), and when init = 0.

Stan code for this model (generated with brms 2.17.0):

functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=-1,upper=2> cens[N];  // indicates censoring
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // dispersion parameter
}
transformed parameters {
  real lprior = 0;  // prior contributions to the log posterior
  lprior += student_t_lpdf(Intercept | 3, 99.9, 9);
  lprior += student_t_lpdf(sigma | 3, 0, 9)
    - 1 * student_t_lccdf(0 | 3, 0, 9);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + Xc * b;
    for (n in 1:N) {
    // special treatment of censored data
      if (cens[n] == 0) {
        target += normal_lpdf(Y[n] | mu[n], sigma);
      } else if (cens[n] == 1) {
        target += normal_lccdf(Y[n] | mu[n], sigma);
      } else if (cens[n] == -1) {
        target += normal_lcdf(Y[n] | mu[n], sigma);
      }
    }
  }
  // priors including constants
  target += lprior;
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}

In contrast, left-censoring works:

# Left censoring
brm(bf(y2 | cens(cen2) ~ x),
    data = d)

I tested this on two machines.

# Machine 1
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] brms_2.17.0     Rcpp_1.0.9      forcats_0.5.1   stringr_1.4.0   dplyr_1.0.6     purrr_0.3.4     readr_1.4.0     tidyr_1.1.3     tibble_3.1.8   
[10] ggplot2_3.3.6   tidyverse_1.3.1

# Machine 2
> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8    LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] brms_2.17.0     Rcpp_1.0.9      forcats_0.5.2   stringr_1.4.1   dplyr_1.0.9     purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.8   
[10] ggplot2_3.3.6   tidyverse_1.3.2
paul-buerkner commented 1 year ago

Not sure who would be the best person to tag here but I think this issue deserves some attention given that normal_lccdf is quite a relevant function. If it turns out to indeed be a bug in that function, we should aim to fix it sooner rather than later. @wds15 would you mind checking or do you have a suggestion for who to tag here?

nhuurre commented 1 year ago

This is probably a duplicate of #1985 .

paul-buerkner commented 1 year ago

Yes, indeed, thank you!

wds15 commented 1 year ago

@paul-buerkner I think @nhuurre is on the spot. The cdf & ccdf's are not great in their precision and looking through the history of things here it's unfortunate that this has not yet been resolved. For brms one could try to solve the matter by resorting to normal_lcdf and sign-flipping the mean possibly? This obviously needs to be solve in Stan-math eventually...but rstan is still so much behind such that most R users would love the workaround...

paul-buerkner commented 1 year ago

I agree. Not sure, how a workaround would look like in brms though, given brms strict naming conventions on functions and our inability to locally overwrite built-in with custom functions (I think?)

wds15 commented 1 year ago

Not sure either. You can't overload a built in function, no. You'd need to rename this. Maybe normalcens_lccdf will be the distribution and then you can create a user function handling this.