mathematically defining transforms

The paper's getting a bit out of hand with everyone writing in their own style. I'm trying to work with Meenal now to strip out most of what's there and then build up a draft with coherent notation. As such, the first thing we need to do is define transforms. To me, the goal of the whole business is that when we sample on the unconstrained scale and transform, we get a uniform distribution on the constrained scale. So informally, we want something like this.

Suppose $\mathcal{Y}$ is a domain of unconstrained parameters and $\mathcal{X}$ is a domain of constrained parameters. For now, $\mathcal{Y} = \mathbb{R}^N$ for some $N$ and $\mathcal{X} \subseteq \mathbb{R}^M$ for some $M$. We further suppose a (perhaps improper) density function $p_Y: \mathcal{Y} \rightarrow (0, \infty)$ with support over all elements of $\mathcal{Y}$. We further suppose we have a surjective (onto) map $g:\mathcal{Y} \rightarrow \mathcal{X}$. Informally, we want to ensure that if we have $Y \sim p_Y(\cdot)$, then $g(Y) \sim \textrm{uniform}(\mathcal{X}).$

The reason this definition doesn't work is that $p_Y$ may be improper, so it doesn't really make sense to talk about $Y \sim p_Y(\cdot)$. So instead, I think we want this:

Definition (Constraining Transform): Suppose $\mathcal{Y}$ is a domain of unconstrained parameters and $\mathcal{X}$ is a domain of constrained parameters. Further suppose a (perhaps improper) density function $p_Y: \mathcal{Y} \rightarrow (0, \infty)$ with support over all elements of $\mathcal{Y}$ and a surjective map $g:\mathcal{Y} \rightarrow \mathcal{X}$. The pair $(p_Y, g)$ is a constraining transform from $\mathcal{Y}$ to $\mathcal{X}$ if for any proper density $\pi_X(x)$ defined over $\mathcal{X}$, the density defined by $p_Y(y) = \pi_Y(y) \cdot \pi_X(g(y))$ is proper and if $Y \sim \pi_Y$, then $X = g(Y) \sim \pi_X$.

We can then go on and define the special case of one-one changes of variables.

Lemma (Change of Variables): Suppose $\mathcal{Y} = \mathbb{R}^N$, $\mathcal{X} \subseteq \mathbb{R}^N$, $\pi_X:\mathcal{X} \rightarrow (0, \infty)$ is a proper density, and $f:\mathcal{X} \rightarrow \mathcal{Y}$ is a differentiable function. If we let $\pi_Y(y) = \left| \textrm{det} \frac{\partial}{\partial y} f^{-1}(y) \right|$, then the pair $(\pi_Y, f^{-1})$ is a constraining transform. Proof: From the change of variables formula, $p_Y(y) = p_X(f^{-1}(y)) \left| \textrm{det} \frac{\partial}{\partial y} f^{-1}(y) \right|.$

We can then move on to the example of simplexes, where we are defining a transform from $\mathbb{R}^N$ to $\Delta^N \subseteq \mathbb{R}^{N+1}$ by setting element $N + 1$ to $1$ minus the sum of the first $N$ elements. Another example is going from $\binom{M}{2} + M$ to $M \times M$ dimensions with the Cholesky factor of covariance matrix transform by multiplying by itself transposed.

In general I like this formulation, but I had some thoughts (sorry this is long):

In the Definition, what is $\pi_Y(y)$? It does not seem to be defined.
Is there a particular reason to use partial derivative notation instead of a Jacobian? Since we'll likely be mentioning Jacobians a lot, that might be cleaner to use.
We will from time to time compose multiple maps. e.g. for hyperspherical coordinates, we compose the maps to the angles in the interval (0, pi] with the map to Cartesian coordinates. By the definition here, only the first map or the entire composition is a constraining transform. Which is fine, but change of variables would apply individually to each map in the sequence, which allows us to incrementally build up log-det-jac terms for composed maps. Might be better to define change of variables separately then constraining transforms.
In constrained spaces, it's not always obvious what uniformity means. There may be multiple competing notions of uniformity (e.g. the Byrnes&Girolami paper notes that for Dirichlet, alpha=1 and alpha=0.5 are both uniform distributions on the simplex for different notions of uniformity). The uniformity that we want to produce by our constraining transform is the uniform measure that the target density $\pi_X$ is defined with respect to. e.g. if a user copies the von Mises-Fisher (vMF) density defined wrt the uniform (Lebesgue) measure on spherical coordinates and then samples using Stan's unit vector parameterization, they will not target vMF because the measures are incompatible. I suggest we should be explicit that we assume our densities are defined wrt to a uniform measure for some notion of uniformity, so the desired constraining transform is one that takes some measure on the unconstrained domain and turns it into the desired uniform measure on the constrained transform. I think we can say this without opening the Pandora's box of measure theory, and it sets up for later transforms where we need to explicitly define what we mean by "uniform"
Separate from notion of uniformity is the notion of parameterization. e.g. a transform $g: \mathcal{Y} \to \mathcal{X}$ outputs a certain parameterization of $x$. e.g. for PD matrices, densities are defined wrt the Lebesgue measure on a triangle of the matrix. So the map from unconstrained space to the triangle is the one we need the log-det-jac of. But we expect PD matrices to be parameterized not by just a triangle but rather by a dense symmetric matrix. So the transform we write down and that we might implement in code often isn't the one we need the Jacobian of.
This is personal preference, but I think it's cleaner notationally if we work primarily with forward transforms (constraining) than reverse (unconstraining). e.g. if $x = f(y)$ (instead of $y=f(x)$), then we can simplify notation by not including inverses everywhere (I know what I'm proposing is the opposite of what the Stan docs do). in practice the unconstraining transform is at most only used to initialize sampling, and it is the constraining map and its log-det-jac that we regularly use. Also, the constraining transform may not be invertible. e.g. f: y -> (normalize(y), norm(y); we can sample just fine on $\mathcal{Y}$ and map to $\mathcal{X}$, but in the reverse direction, we cannot reach $y=0$.
"support over all elements of $\mathcal{Y}$": I think technically it is okay if there are elements not supported, so long as the set of such elements has measure 0. This is fine for MCMC but could be problematic for optimization. (e.g. the transform Stan uses for unit vectors; if the posterior is uniform on the unit vectors, then the distribution on $\mathcal{Y}$ is normal, so its mode is at the singularity where the constraining transform is undefined.
Another example is going from $\binom{M}{2} + M$ to $M \times M$ dimensions with the Cholesky factor of covariance matrix transform by multiplying by itself transposed.

Here is a case where I think it pays to be specific about the notion of uniformity and how it differs from choice of parameterization. Rightly so, a reader may be confused how such a map that apparently has a larger dimensional output than input could be bijective. And how can they compute a Jacobian? But if we clarify that the notion of uniformity is over a super-diagonal, while the parameterization is the dense symmetric matrix, then its clear both what map they need to compute the transform and which map they need the Jacobian of.

There may be multiple competing notions of uniformity

I suggest we should be explicit that we assume our densities are defined wrt to a uniform measure for some notion of uniformity, so the desired constraining transform is one that takes some measure on the unconstrained domain and turns it into the desired uniform measure on the constrained transform.

I would love to make this explicit in a way that I can understand. My measure theory is weak past the basic definitions of sigma algebras and Lebesgue integration. The uniformity we want is for all $x \in \mathcal{X}$,

$$p_X(x) = \textrm{const}$$

I don't know how to say that when we're dealing with unconstrained spaces that lead to improper "densities".

"support over all elements of Y" : I think technically it is okay if there are elements not supported, so long as the set of such elements has measure 0.

Thanks. We should be as mathematically precise as possible without turning this into an impenetrable measure theory paper.

In practice, we run into numerical problems even in simple positive-constraining transforms like exp() which quickly overflow to +infinity and underflow to zero.

Rightly so, a reader may be confused how such a map that apparently has a larger dimensional output than input could be bijective.

Isn't it only bijective with the constrained space? The Jacobian is from the Cholesky factor with $\binom{N}{2} + N$ elements to the lower-triangular portion of the covariance matrix, which has the same number of elements. The above-diagonal elements are just copied. So we'll go through and say that as we do it. Same with the simplex---the transform is just to the first $N - 1$ dimensions of the output with the last element being defined as one minus the sum of the other elements.

But if we clarify that the notion of uniformity is over a super-diagonal, while the parameterization is the dense symmetric matrix, then its clear both what map they need to compute the transform and which map they need the Jacobian of.

What does super-diagonal mean?

This is personal preference, but I think it's cleaner notationally if we work primarily with forward transforms

I was trying to lay out the general approach that way. Do you think we should just skip the usual direction of transform and inverse when talking about changes of variables? We can do that, but we probably want a footnote or something to connect the dots, because every stats presentation I've ever seen does it the way I wrote it in the Stan reference manual.

Separate from notion of uniformity is the notion of parameterization. e.g. a transform g:Y→X outputs a certain parameterization of x. e.g. for PD matrices, densities are defined wrt the Lebesgue measure on a triangle of the matrix. So the map from unconstrained space to the triangle is the one we need the log-det-jac of. But we expect PD matrices to be parameterized not by just a triangle but rather by a dense symmetric matrix. So the transform we write down and that we might implement in code often isn't the one we need the Jacobian of.

similar thing with alr too right? we don't actually take the Nth dimension in the jacobian. I'm not sure about how would one go about communicating that idea, in a general way

mjhajharia / transforms

mathematically defining transforms #36