matrix normal distribution

Implement one or more forms of the matrix normal distribution. The general distribution is of the form

matrix_normal(matrix[N, P] y | matrix[N, P] mu, cov_matrix[N, N] SigmaRow, cov_matrix[P, P] SigmaCol),

where SigmaRow and SigmaCol are positive definite covariance matrices. The covariance matrix SigmaRow controls the covariance of the rows of y whereas SigmaRow controls the covariances of the columns.

There are three design decisions about how SigmaRow and SigmaCol are parameterized, which are whether to use

diagonal or dense matrices,
covariance or precision (inverse covariance), and
Cholesky factors or full matrices.

There is also the issue of how to resolve the non-identifiability of scale between the two matrices, because

matrix_normal(y | mu, SigmaRow, SigmaCol) == matrix_normal(y | mu, c * SigmaRow, (1 / c) * SigmaCol).

The log density can be expressed as follows, but this is not how we'd compute it.

log matrix_normal(y | mu, SigmaRow, SigmaCol)
= - 1 / 2 * tr(inv(SigmaCol) * (y - mu)' * inv(SigmaRow) * (y - mu))
   - N / 2 * log det(SigmaRow) 
   - P / 2 * log det(SigmaCol)
   - N * P / 2 * log(2 * pi())

If LRow and LCol are Cholesky factors of SigmaRow and SigmaCol so that, e.g., SigmaRow = LRow * LRow', and z ~ normal(0, I), then

y = mu + LRow * z * LCol ~ matrix_normal(mu, SigmaRow, SigmaCol)

As shown on the Wikipedia page for matrix normal, we can evaluate that trace term as

tr(inv(SigmaCol) * (y - mu)' * inv(SigmaRow) * (y - mu))
= (vec(y) - vec(mu))' * inv(SigmaCol kronecker SigmaRow) * (vec(y) - vec(mu))

where vec(y) is the flattening of y. We then want to use the property of Kronecker products that

inv(SigmaCol kronecker SigmaRow) = inv(SigmaCol) kronecker inv(SigmaRow).

@bgoodri suggested going further with Cholesky factorizations LRow and LCol using the rules for inverse of the product of square matrixes inv(A * B) = inv(B) * inv(A) and the trace of a product rule tr(A' * B) = vec(A)' * vec(B), to derive

tr(inv(SigmaCol) * (y - mu)' * inv(SigmaRow) * (y - mu))
= tr(inv(LCol * LCol') * (y - mu)' * inv(LRow * LRow') * (y - mu))
= tr(inv(LCol') * inv(LCol) * (y - mu)' * inv(LRow') * inv(LRow) * (y - mu))
= tr((LCol' \ LCol \ (y - mu)') * (LRow' \ LRow \ (y - mu)))
= dot_product(vec(LCol' \ LCol \ (y - mu')), vec(LRow' \ LRow \ (y - mu)))

Identifiability will be left up to the user. One way to achieve it is to define one of the matrices to be a correlation matrix with a simplex of scales. That's most efficient when done with a simple product of a Cholesky factor rather than a quadratic form for the full correlation matrix. A second, more symmetric way to achieve a proper posterior is to decompose both covariance matrices into scales time Cholesky factors and put priors on both of the scales.

There is also a matrix Student-t distribution, which should just follow whatever matrix normal does.

stan-dev / math

matrix normal distribution #86