stan-dev / math

The Stan Math Library is a C++ template library for automatic differentiation of any order using forward, reverse, and mixed modes. It includes a range of built-in functions for probabilistic modeling, linear algebra, and equation solving.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
736 stars 185 forks source link

add Givens-based orthonormal matrix transform and inverse plus Jacobian #2590

Open bob-carpenter opened 2 years ago

bob-carpenter commented 2 years ago

Description

We would like to have an orthonormal matrix type for constrained variables in Stan. That requires defining the transform and its inverse plus log Jacobian determinant in the math library first.

Among the applications are factor models and parameterizations of positive definite matrices in terms of eigenvectors.

This would amount to a data type orthonormal[P, Q] in Stan and three functions in the math library, which have roughly this signature (see, e.g., corr_matrix_free.hpp and corr_matrix_constrain.hpp in directory stan/math/prim/fun. This amounts to two files with a total of three function signatures.

orthonormal_free.hpp

Eigen::Matrix<T, 1, -1> orthonormal_free(const Matrix<T, -1, -1>& y);

orthonormal_constrain.hpp

// w/o Jacobian adjustment
Eigen::Matrix<T, -1, -1> orthonormal_constrain(const Matrix<T, -1, 1>& x, int P, int Q);

// w Jacobian adjustment
Eigen::Matrix<T, -1, -1> orthonormal_constrain(const Matrix<T, -1, 1>& x, int P, int Q, T& lp);

Example

Reference

We can follow the code and transforms laid out here:

Arya A. Pourzanjani, Richard M. Jiang, Brian Mitchell, Paul J. Atzberger, Linda R. Petzold. 2021. Bayesian Inference over the Stiefel Manifold via the Givens Representation. Bayesian Analysis.

Derivative reduction

Algorithm 1 has the constraining transform, where angles are defined by

theta[i, j] = atan2(y[i, j], y[i, i]).

Equation (4.5) lists the Jacobian adjustment, which involves terms that on the log scale are equal to

(j - i - 1) * log(cos(theta[i, j]))

We can exploit the fact that

d/du log cos(u) = -tan(u)

and the definition of theta[i, j] to reduce to

d/d.theta[i, j] (j - i - 1) * log(cos(theta[i, j])) =  (j - i - 1) * y[i, j] / y[i, i].

Auxiliary variable approach

What I don't see how to do immediately is deal with the overparameterization and unit normal "prior" that makes Pourzanjani et al.'s approach feel very much like Marsaglia's approach to unit vectors that we currently employ.

Unconstraining transform

We also need the transform to go from the constrained to the unconstrained spaces (for initialization). I couldn't find it in the paper.

Original GitHub repo

More details can probably be extracted from Pourzanjani's GitHub repo, TfRotationPca.

Relation to unit vector coding

We should be able to code unit vectors as (p, 1) orthonormal matrices. The two approaches with auxiliary variables are closely related, and may be identical.

Other repos

We also need to add the doc for the transform to the reference manual (docs repo) and the actual data type to the language and code generate (stanc3 repo).

Unit and negative unit determinants

@bgoodri and Pourzanjani were discussing the fact that the Givens approach leads to solutions which can have positive or negative unit determinants. Maybe we could constrain to one in the prior, but in higher dimensions, this turns into combinatorial multimodality.

Alternatives

There is more than one way to do this. Pourzanjani's paper considers the geodesic Monte Carlo (GMC) approach of Byrne and Girolami. It's also possible to use Cayley transforms.

Expected Output

Invertible transforms with Jacobians so we can sample uniformly overly over Stiefel manifolds.

Current Version:

v4.1.0

spinkney commented 2 years ago

Unconstraining transform We also need the transform to go from the constrained to the unconstrained spaces (for initialization). I couldn't find it in the paper.

I don't think this is technically invertible on the entire real line. Just like the unit_vector approach isn't really a bijection since the origin is left out (see https://math.stackexchange.com/a/2629480).

@betanalpha commented on the Pourzanjani paper on discourse:

This method does not fully parameterize the space of orthogonal matrices. It instead defines a chart over a Steifel manifold that will necessarily always have a slice of parameters that it cannot describe (i.e. the chart is not isomorphic to the entire space). All of the issues with reflections and reorientations are consequences of this.

from https://discourse.mc-stan.org/t/new-transform-for-orthonormal-matrices-in-stan/2981/2

Is there a clever way to ignore these spaces for invertibility defined only on the subset?

bob-carpenter commented 2 years ago

What we do with the unit vector coding is to make the transform f (constrained to unconstrained) one to one and the inverse transform f_inv many to one, making sure that f_inv(f(x)) = x for constrained x.

bgoodri commented 2 years ago

Upon further review, can this method generate square orthogonal matrices with negative determinants? I can't seem to generate one when doing this in R:

rstan::expose_stan_functions("https://raw.githubusercontent.com/pourzanj/TfRotationPca/master/Stan/UniformStiefel.stan")
sign_det <- 1
counter <- 1
while (sign_det == 1) {
  theta <- runif(3, min = -pi, max = pi)
  W <- area_form_lp(theta, n = 3L, p = 3L)
  sign_det <- sign(det(W))
  counter <- counter + 1
}
spinkney commented 2 years ago

What we do with the unit vector coding is to make the transform f (constrained to unconstrained) one to one and the inverse transform f_inv many to one, making sure that f_inv(f(x)) = x for constrained x.

Yea, my point is more semantics rather than practical. I don't think it's called a bijection since it cannot be one-to-one. There's more at https://discourse.mc-stan.org/t/divergence-treedepth-issues-with-unit-vector/8059. In fact that thread suggests using a normal(0, 0.1) rather than standard normal for the unit vector transform.

I think this rather subtle point should be made more clear in the docs for unit vector. Similarly, I'm curious when this Givens-based orthonormal transform shows divergences or treedepth issues, which we can document as well.

Here's stan code for the Cayley method https://github.com/michaeljauch/cayley/blob/master/stiefel_unif.stan. I'm not sure if it's correct. The corresponding paper is https://arxiv.org/abs/1810.02881.

betanalpha commented 2 years ago

Yea, my point is more semantics rather than practical. I don't think it's called a bijection since it cannot be one-to-one.

Correct, the “embedding methods” are not bijections and we really shouldn’t be using the same naming for them, including in the code.

The embedding space to embedded surface is a projection/surjection, and the inverse “lifting” map isn’t uniquely well-defined.

I think this rather subtle point should be made more clear in the docs for unit vector. Similarly, I'm curious when this Givens-based orthonormal transform shows divergences or treedepth issues, which we can document as well.

If I remember correctly someone experimented with a proper embedding approach in the Forums (not the Givens or Cayley local charts but an actual embedding like the unit vector implementation) but was plagued by intense divergences and other fitting problems. The

The problem in all of these approaches is that the data will inform posterior concentration on only the embedded surface, without no constraints on the behavior transverse to the embedding surface. Only the default prior that is implicitly added affects those, and the intersection of these two constraints can lead to a lot of cone-like geometries that diverge at the tip.

bob-carpenter commented 2 years ago

Thanks, @betanalpha, that's all really helpful.

my point is more semantics rather than practical.

I agree. We shouldn't call a non-bijective function a bijection! I think TensorFlow Probability is the one who introduced the "bijector" data structure to encapsulate a transform, an inverse defined over the range of the transform (and possibly extended to a larger domain, hence the non-bijectiveness), and the log absolute Jacobian determinant of the inverse.

In fact that thread suggests using a normal(0, 0.1) rather than standard normal for the unit vector transform.

I take it that makes the actual parameters more unit scale?

I think this rather subtle point should be made more clear in the docs for unit vector

@spinkney If you could open a docs issue with a suggestion on which docs to fix (reference manual or user's guide or both) and what you'd like to see, I can write it.

And thanks for the reference to the other embeddings.

This kind of thing is why we're thinking about adding user-defined transforms to the language. The only question there is how to make the result clearer than spreading transform logic over three blocks as we currently do.

betanalpha commented 2 years ago

This kind of thing is why we're thinking about adding user-defined transforms to the language. The only question there is how to make the result clearer than spreading transform logic over three blocks as we currently do.

In my opinion we make the users implement the transformation themselves, which they can already do.

parameters { unconstrained_type unconstrained_name; }

transformed parameters { constrained type constrained_name = constraining_map(unconstrained_name); }

Zero ambiguity and the constrained implemented wherever it’s most natural in the Stan program.

We can consider implement and explore maps with more direct names, for example “pos_real_to_real” instead of “log”, but that introduces the problem that there are many maps from the positive reals to the reals other than the log. Users have to know why they’re using log instead of any of the other infinite possibilities, although this is already what the language largely forces users to do.

Personally I think much of the problem here is doing too much to abstract important model structure from users, which just facilitates unconsidered defaults.