Open andrjohns opened 2 years ago
Thank you for working on such a feature! The license approach looks good to me.
Can you tell me a bit about your envisioned integration of this feature in brms? E.g., how well can it be combined with other features and how would the user interface look like?
Great! My current plan is to add a copula
argument to set_rescor
, like: set_rescor(rescor = TRUE, copula = "gaussian")
. This keeps the interface consistent across all correlated-outcome models and also keeps the option available for other copula types to be added in the future.
As for the actual implementation, the Stan code itself is pretty flexible and essentially just takes the same arguments that would be passed to a given non-glm _lpdf
function.
It can be a bit tricky because the parameters/ arguments for each type (e.g., gaussian, poisson) have to be packed into matrices before being passed to their respective marginal
functions. I'm not sure yet how this will interact with the more complex aspects of the brms
language, but it's still early days.
The current rescor
infrastructure appears to handle the multiple outcome types with minimal modifications. Here's a quick example of how the syntax and output currently looks:
setwd("~/brms")
devtools::load_all()
#> i Loading brms
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.16.11). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
set.seed(1234)
n_out = mnormt::rmnorm(n = 500, mean = c(0, 0),
varcov = cbind(c(1,-0.6),c(-0.6,1)))
bdata = data.frame(
y1 = n_out[,1],
y2 = n_out[,2],
c = rpois(500, 10)
)
mn <- bf(y1 ~ 1, family = gaussian())
mn2 <- bf(y2 ~ 1, family = gaussian())
mp <- bf(c ~ 1, family = poisson())
mod_n <- mn + mp + mn2 + set_rescor(rescor = TRUE, copula = "gaussian")
t_n <- brm(mod_n, data = bdata, cores = 4)
#> Compiling Stan program...
#> Start sampling
summary(t_n)
#> Family: MV(gaussian, poisson, gaussian)
#> Links: mu = identity; sigma = identity
#> mu = log
#> mu = identity; sigma = identity
#> Formula: y1 ~ 1
#> c ~ 1
#> y2 ~ 1
#> Data: bdata (Number of observations: 500)
#> Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#> total post-warmup draws = 4000
#>
#> Population-Level Effects:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> y1_Intercept -0.03 0.05 -0.12 0.06 1.00 4238 3116
#> c_Intercept 2.30 0.01 2.27 2.33 1.00 7766 3061
#> y2_Intercept 0.00 0.04 -0.08 0.09 1.00 4501 3254
#>
#> Family Specific Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma_y1 1.03 0.03 0.97 1.09 1.00 4913 3228
#> sigma_y2 0.98 0.03 0.93 1.05 1.00 4811 3321
#>
#> Residual Correlations:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> rescor(y1,c) -0.05 0.05 -0.14 0.04 1.00 7102 2851
#> rescor(y1,y2) -0.61 0.03 -0.66 -0.55 1.00 4559 3094
#> rescor(c,y2) 0.07 0.04 -0.02 0.15 1.00 6448 2972
#>
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
Created on 2022-03-14 by the reprex package (v2.0.1)
Where the generated Stan code is:
//> make_stancode(mod_n, data = bdata, backend = "cmdstanr")
// generated with brms 2.16.11
functions {
/*
NOTE:
The Gaussian copula Stan code is distributed against the BSD 3-Clause license
BSD 3-Clause License
Copyright (c) 2021, Sean Pinkney
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
/* Returns the inverse of the matrix whose Cholesky factor is L
*
* Reimplemented for compatibility with Stan < 2.26
*
* Args:
* L: Matrix that is a Cholesky factor
* Returns:
* The matrix inverse of L * L'
*/
matrix chol2inv2(matrix L) {
int K = rows(L);
matrix[K, K] K_identity = diag_matrix(rep_vector(1.0, K));
matrix[K, K] L_inv = mdivide_left_tri_low(L, K_identity);
matrix[K, K] rtn;
if (K == 0) {
return L;
}
if (K == 1) {
rtn[1, 1] = inv_square(L[1, 1]);
} else {
for (k in 1:K) {
rtn[k, k] = dot_self(tail(L_inv[ : , k], K + 1 - k));
for (j in (k + 1):K) {
int Kmj = K - j;
rtn[k, j] = dot_product(tail(L_inv[ : , k], Kmj + 1),
tail(L_inv[ : , j], Kmj + 1));
rtn[j, k] = rtn[k, j];
}
}
}
return rtn;
}
/**
* Multivariate Normal Copula log density (Cholesky parameterisation)
*
* See https://github.com/spinkney/helpful_stan_functions for
* further documentation
*
* @copyright Ethan Alt, Sean Pinkney 2021
*
* @param U Matrix of outcomes from marginal calculations
* @param L Cholesky factor of the correlation/covariance matrix
* @return log density of distribution
*/
real multi_normal_cholesky_copula_lpdf(matrix U, matrix L) {
int N = rows(U);
int J = cols(U);
matrix[J, J] Gammainv = chol2inv2(L);
return -N * sum(log(diagonal(L))) - 0.5 * sum(add_diag(Gammainv, -1.0) .* crossprod(U));
}
/**
* Mixed Copula Log-Probability Function
*
* @copyright Andrew Johnson 2022 n
*
* @param marginals Nested arrays of matrices from marginal calculations
* @param L Cholesky Factor Correlation
* @return Real log-probability
*/
real centered_gaussian_copula_cholesky_(matrix[,] marginals, matrix L, int[] order) {
// Extract dimensions of final outcome matrix
int N = rows(marginals[1][1]);
int J = rows(L);
matrix[N, J] U;
// Iterate through marginal arrays, concatenating the outcome matrices by column
// and aggregating the log-likelihoods (from continuous marginals) and jacobian
// adjustments (from discrete marginals)
real adj = 0;
int pos = 1;
for (m in 1 : size(marginals)) {
int curr_cols = cols(marginals[m][1]);
U[ : , pos : (pos + curr_cols - 1)] = marginals[m][1];
adj += sum(marginals[m][2]);
pos += curr_cols;
}
// Return the sum of the log-probability for copula outcomes and jacobian adjustments
return multi_normal_cholesky_copula_lpdf(U[ : , order] | L) + adj;
}
/**
* Normal marginal
*
* Standardized normal marginal for mixed continuous-discrete Gaussian
* copula.
*
* See https://github.com/spinkney/helpful_stan_functions for
* further documentation
*
* @copyright Ethan Alt, Sean Pinkney 2021 n
*
* @param y matrix of normal outcomes
* @param mu_glm row_vector of regression means
* @param matrix vector of outcome SD's
* @return 2D array of matrices containing the random variables
* and jacobian adjustments
*/
matrix[] normal_marginal(matrix y, matrix mu_glm, vector sigma) {
int N = rows(mu_glm);
int J = cols(mu_glm);
matrix[N, J] rtn[2];
// Initialise the jacobian adjustments to zero, as vectorised lpdf will be used
rtn[2] = rep_matrix(0, N, J);
for (j in 1 : J) {
rtn[1][ : , j] = (y[ : , j] - mu_glm[ : , j]) / sigma[j];
rtn[2][1, j] = normal_lpdf(y[ : , j] | mu_glm[ : , j], sigma[j]);
}
return rtn;
}
/**
* Poisson marginal
*
* Poisson marginal for mixed continuous-discrete Gaussian
* copula.
*
* See https://github.com/spinkney/helpful_stan_functions for
* further documentation
*
* @copyright Ethan Alt, Sean Pinkney 2021 n
*
* @param y int[,] 2D array of integer counts
* @param mu_glm matrix of regression means
* @param u_raw matrix of nuisance latent variables
* @return 2D array of matrices containing the random variables
* and jacobian adjustments
*/
matrix[] poisson_marginal(int[,] y, matrix mu_glm, matrix u_raw) {
int N = rows(mu_glm);
int J = cols(mu_glm);
matrix[N, J] mu_glm_exp = exp(mu_glm);
matrix[N, J] rtn[2];
for (j in 1 : J) {
for (n in 1 : N) {
real Ubound = poisson_cdf(y[n, j], mu_glm_exp[n, j]);
real Lbound = 0;
real UmL;
if (y[n, j] > 0) {
Lbound = poisson_cdf(y[n, j] - 1, mu_glm_exp[n, j]);
}
UmL = Ubound - Lbound;
rtn[1][n, j] = inv_Phi(Lbound + UmL * u_raw[n, j]);
rtn[2][n, j] = log(UmL);
}
}
return rtn;
}
}
data {
int<lower=1> N; // total number of observations
int<lower=1> N_y1; // number of observations
vector[N_y1] Y_y1; // response variable
int<lower=1> N_c; // number of observations
int Y_c[N_c]; // response variable
int<lower=1> N_y2; // number of observations
vector[N_y2] Y_y2; // response variable
int<lower=1> nresp; // number of responses
int nrescor; // number of residual correlations
int prior_only; // should the likelihood be ignored?
}
transformed data {
int Outcome_Order[3]; // response array
matrix[N, 2] Y_gaussian; // response array
int Y_poisson[N, 1]; // response array
Outcome_Order[1] = 1;
Y_gaussian[ : , 1] = Y_y1;
Outcome_Order[2] = 3;
Y_gaussian[ : , 2] = Y_y2;
Outcome_Order[3] = 2;
Y_poisson[ : , 1] = Y_c;
}
parameters {
real Intercept_y1; // temporary intercept for centered predictors
real<lower=0> sigma_y1; // dispersion parameter
real Intercept_c; // temporary intercept for centered predictors
real Intercept_y2; // temporary intercept for centered predictors
real<lower=0> sigma_y2; // dispersion parameter
matrix<lower=0, upper=1>[N, 1] URaw_poisson;
cholesky_factor_corr[nresp] Lrescor; // parameters for multivariate linear models
}
transformed parameters {
real lprior = 0; // prior contributions to the log posterior
lprior += student_t_lpdf(Intercept_y1 | 3, -0.1, 2.5);
lprior += student_t_lpdf(sigma_y1 | 3, 0, 2.5)
- 1 * student_t_lccdf(0 | 3, 0, 2.5);
lprior += student_t_lpdf(Intercept_c | 3, 2.3, 2.5);
lprior += student_t_lpdf(Intercept_y2 | 3, 0, 2.5);
lprior += student_t_lpdf(sigma_y2 | 3, 0, 2.5)
- 1 * student_t_lccdf(0 | 3, 0, 2.5);
lprior += lkj_corr_cholesky_lpdf(Lrescor | 1);
}
model {
// likelihood including constants
if (!prior_only) {
// initialize linear predictor term
vector[N_y1] mu_y1 = Intercept_y1 + rep_vector(0.0, N_y1);
// initialize linear predictor term
vector[N_c] mu_c = Intercept_c + rep_vector(0.0, N_c);
// initialize linear predictor term
vector[N_y2] mu_y2 = Intercept_y2 + rep_vector(0.0, N_y2);
// multivariate predictor array
matrix[N, 2] Mu_gaussian;
// multivariate predictor array
matrix[N, 1] Mu_poisson;
vector[2] sigma = transpose([sigma_y1, sigma_y2]);
Mu_gaussian[ : , 1] = mu_y1;
Mu_gaussian[ : , 2] = mu_y2;
Mu_poisson[ : , 1] = mu_c;
target += centered_gaussian_copula_cholesky_({normal_marginal(Y_gaussian, Mu_gaussian, sigma),poisson_marginal(Y_poisson, Mu_poisson, URaw_poisson)} , Lrescor, Outcome_Order);
}
// priors including constants
target += lprior;
}
generated quantities {
// actual population-level intercept
real b_y1_Intercept = Intercept_y1;
// actual population-level intercept
real b_c_Intercept = Intercept_c;
// actual population-level intercept
real b_y2_Intercept = Intercept_y2;
// residual correlations
corr_matrix[nresp] Rescor = multiply_lower_tri_self_transpose(Lrescor);
vector<lower=-1,upper=1>[nrescor] rescor;
// extract upper diagonal of correlation matrix
for (k in 1:nresp) {
for (j in 1:(k - 1)) {
rescor[choose(k - 1, 2) + j] = Rescor[j, k];
}
}
}
At the moment it's only hard-coded for gaussian, poisson, bernoulli, and binomial marginals, but is compatible with any distribution that has a CDF defined - so these marginal functions could just be generated in the future
But also, this is more of a proof-of-concept at the moment (I just needed the copulas for a work project), so I'm not committed to any particular implementation
As better reprex, here's an example of the implementation accurately recovering the correlations and generating parameters for a model with gaussian, poisson, and binomial outcomes:
setwd("~/brms")
devtools::load_all()
#> i Loading brms
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.16.11). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
set.seed(2022)
cor <- round(randcorr::randcorr(4), 1) # Rounding for clarity
# Generate correlated standard-normal variates
n_out <- mnormt::rmnorm(n = 500, mean = c(0, 0, 0, 0),
varcov = cor)
# Put on uniform [0, 1] scale using standard normal CDF (Phi function)
p_out <- pnorm(n_out)
# Generate denominators for binomial
d <- sample(10:100, 500, replace = T)
# Transform uniform variates to desired distributions using
# quantile function with desired parameters
bdata <- data.frame(
y1 = qnorm(p_out[,1], 10, 4),
y2 = qnorm(p_out[,2], 3, 6),
c = qpois(p_out[,3], 15),
n = qbinom(p_out[,4], prob = 0.3, size = d),
d = d
)
colnames(cor) <- c("y1","y2","c","n")
rownames(cor) <- c("y1","y2","c","n")
mn <- bf(y1 ~ 1, family = gaussian())
mp <- bf(c ~ 1, family = poisson())
mn2 <- bf(y2 ~ 1, family = gaussian())
mb <- bf(n | trials(d) ~ 1, family = binomial())
mod_n <- mn + mp + mn2 + mb + set_rescor(rescor = TRUE, copula = "gaussian")
t_n <- brm(mod_n, data = bdata, cores = 4)
#> Compiling Stan program...
#> Start sampling
summary(t_n)
#> Family: MV(gaussian, poisson, gaussian, binomial)
#> Links: mu = identity; sigma = identity
#> mu = log
#> mu = identity; sigma = identity
#> mu = logit
#> Formula: y1 ~ 1
#> c ~ 1
#> y2 ~ 1
#> n | trials(d) ~ 1
#> Data: bdata (Number of observations: 500)
#> Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
#> total post-warmup draws = 4000
#>
#> Population-Level Effects:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> y1_Intercept 9.98 0.17 9.66 10.31 1.00 5207 3102
#> c_Intercept 2.70 0.01 2.67 2.72 1.00 4775 3203
#> y2_Intercept 3.35 0.25 2.86 3.83 1.00 5909 2793
#> n_Intercept -0.85 0.01 -0.87 -0.82 1.00 8587 3695
#>
#> Family Specific Parameters:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> sigma_y1 3.99 0.11 3.78 4.22 1.00 7320 3359
#> sigma_y2 5.65 0.16 5.35 5.97 1.00 5606 2976
#>
#> Residual Correlations:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> rescor(y1,c) 0.67 0.02 0.62 0.71 1.00 6889 2900
#> rescor(y1,y2) -0.45 0.03 -0.51 -0.40 1.00 5202 3053
#> rescor(c,y2) -0.70 0.01 -0.72 -0.67 1.00 4789 3717
#> rescor(y1,n) 0.52 0.03 0.46 0.57 1.00 6279 3204
#> rescor(c,n) 0.52 0.03 0.46 0.58 1.00 5366 3266
#> rescor(y2,n) 0.18 0.03 0.12 0.24 1.00 5276 3350
#>
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
cor
#> y1 y2 c n
#> y1 1.0 -0.5 0.7 0.5
#> y2 -0.5 1.0 -0.7 0.2
#> c 0.7 -0.7 1.0 0.5
#> n 0.5 0.2 0.5 1.0
Created on 2022-03-14 by the reprex package (v2.0.1)
This is a really nice and simple interface. I like it a lot! No objections :-)
Just as a heads-up for you. In brms 3.x, which I will work on this year, my plan is to enable more user control over residual correlations. That is, to allow some (blocks of) variables to be correlated while others (blocks of) variables remain uncorrelated. I still have to figure out the interface exactly but it would tentatively extend set_rescor
too. Suggestions for this interface are of course also welcome.
That's a great idea! I'll keep that in mind during the changes so I'm not doing anything too dramatic (or if I have any suggestions!)
At the moment it's only hard-coded for gaussian, poisson, bernoulli, and binomial marginals, but is compatible with any distribution that has a CDF defined - so these marginal functions could just be generated in the future
Andrew, could this structure be used to create a sort of approximation to correlated multinomial outcomes, along the lines of a multinomial probit, or perhaps something more functional with the same result?
could this structure be used to create a sort of approximation to correlated multinomial outcomes
Not an approximation, but exactly that. Gaussian copulas allow you to model the correlation between outcomes separately from the distributions of those outcomes, so you can explicitly model a correlated multinomial.
However, Gaussian copulas require the CDF for the outcome type, and we don't have a multinomial_cdf
function. If you have a Stan-language implementation then I could add that to brms
and the family would then be compatible.
Indeed: I noticed that in Stan there generally don't seem to be CDFs for discrete distributions other than binary outcomes. Perhaps there is a comparable implementation in scipy
or a similar resource we could borrow? Another option might be to borrow from the Poisson CDF given the connection between the Poisson and the multinomial, although I would need to think that through.
It's mostly just a time issue, as development tends to be driven by user demand and developer interest. We have an open issue in the Math library for missing distribution functions: https://github.com/stan-dev/math/issues/2657, I'd recommend commenting in there so it gets added to the radar.
Additionally, if there is a reference implementation or a summary of the math for the needed function that you can provide, then that also saves a great deal of time
Additionally, if there is a reference implementation or a summary of the math for the needed function that you can provide, then that also saves a great deal of time
Found a few and linked to them in that math issue.
Quick progress update for anyone that's interested.
I've just about finished configuring the implementation to automatically generate the Stan code for the specified marginal distributions, so that it can be automatically compatible with any distribution that has a CDF, rather than hard-coding for a handful of them.
This has identified some issues with numerical stability, as the CDF functions can underflow/overflow relatively easily (as I discovered with some binomial outcomes). The natural resolution is to instead call the log-CDF (noting that the Stan Math implementation might sometimes be just calling the CDF and taking the log, which is another issue). However, the estimation requires passing the result to the inv_Phi
function, so we need an implementation of the inv_Phi
that takes inputs on the log scale. I've implemented this in pure Stan here, with the aim to have it implemented in the Math library eventually
Cool! Is there a reference for that implementation?
Its actually just the Stan Math implementation but translated to the log scale, and I can see from the history that you've already been involved in that one, so you probably have a better idea than me on the reference!
Note that this will also be much faster once it's in c++ - since the autodiff is currently calculating gradients for each of the intermediate log_sum_exp
calls
This implementation looks great, great work @andrjohns!
Is there anything we can do to help get the branch implementing this merged into brms
? I would love to be using and building on this feature, even with the issues identified in https://github.com/paul-buerkner/brms/issues/1317#issuecomment-1072986536.
Hey Paul,
I'm currently working on adding gaussian copulas to
brms
using @spinkney's implementations over in https://github.com/spinkney/helpful_stan_functions/blob/main/functions/copula/centered_gaussian_copula.stanfunctionsThe Stan code in that repository is licensed under BSD-3, which I believe is compatible with the
brms
GPLv2 license (according to the FSF), as long as the copyright notice is included.For the
brms
integration, I'm taking the approach of including the copyright notice when the copula code is used. For example:Would you have any issues with this approach? You can see it in action on my branch over here: https://github.com/andrjohns/brms/tree/feature/gaussian-copula/inst/chunks/copula